
From GPT-1 to GPT-5, LLMs have made tremendous progress in modeling human language. But can they go beyond that to make new discoveries and move the needle on scientific progress? We sat down with distinguished Columbia CS professor Vishal Misra to discuss this, plus why chain-of-thought reasoning works so well, what real AGI would look like, and what actually causes hallucinations.
Loading summary
A
Any LLM that was trained on pre1915 physics would never have come up with a theory of relativity. Einstein had to sort of reject the Newtonian physics and come up with this space time continuum. He completely rewrote the rules. AGI will be when we are able to create new science, new results, new math. When an AGI comes up with a theory of relativity, it has to go beyond what it has been trained on to come up with new paradigms, new science. That's my definition of AGI.
B
Vishal Mistra was trying to fix a broken cricket stats page and accidentally helped Spark One of AI's biggest breakthroughs. On this episode of the A16Z podcast, I talk with Vishal and A16Z's Martin Casado about how that moment led to retrieval, augmentation generation, and how Vishal's formal models explain what large language models can and can't do. We discussed why LLMs might be hitting their limits, what real reasoning looks like, and what it would take to go beyond them. Let's get into it.
C
Martin, I know you wanted to have Vishal on. What do you find so remarkable about him and his contributions that inspired this?
D
Vishal and I actually have very similar backgrounds. We both come from networking. He's a much more accomplished networking guy than I am.
C
That's a high bar given you from.
D
And so we actually view the world in an information theoretic way. It is actually part of networking. And with all this AI stuff, there's so much work trying to create models that can help us understand how these LLMs work. And in my experience over the last three years, the ones that have most impacted my understanding and I think have been the most predictive are the ones that Vishal has come up with. He did a previous one that we're going to talk about called Matrix.
A
Is it beyond the black box? But yeah, the Matrix beyond the black box.
D
Actually we should put this in the notes for this. But the single best talk I've ever seen on trying to understand how LLMs work is one that Vishal did at MIT, which Hari Balakrishnan pointed me to, and I watched that. So. So he did that work and then he's doing more recent work that's actually trying to scope out not only how LLMs reason, but like it has some reflections on humans reason too. And so I just think he's doing some of the more profound work and trying to understand and come up with models, formal models for how LLMs reason.
C
On that note, you said his most recent work helped you change how humans think. Why don't you flesh that out a little bit? How did it sort of, well, okay.
D
So can I just try to take a rough sketch at it and then you just tell me how wrong I am?
A
Right.
D
And you're trying to describe how ll work. And one thing that you found is that they reduce a very, very complex multidimensional space into basically a geometric manifold. That's a reduced state space. So it's reduced degrees of freedom. But you can actually predict where in the manifolds the reasoning can move to roughly. So you've reduced the dimensionality of the problem to a geometric manifold and then you can actually formally specify kind of how far you can reason within that manifold. So and the articulation is that we, or one of the intuitions is that we as humans do the same thing is we take this very complex heavy tailed stochastic universe and we reduce it to kind of this geometric manifold and then when we reason, we just move along that manifold.
A
Yeah, I think you captured it accurately. That's kind of the spirit of the work. Yeah.
D
Wait, wait, can I just hear it in your words? Because I'm a VC, so.
A
You'Re a VC with an H index of what, 60? True. Yeah. So ultimately, what all these LLMs are doing, whether the early LLMs or the LLMs that we have today, with all sorts of post training, RLHF, whatever you do, at the end of the day, what they do is they create a distribution for the next token. Right. So given a prompt, these LLMs create a distribution for the next token or the next word, and then they pick something from that distribution using some, some kind of algorithm to predict the next token, pick it and then keep going. Now what happens because of the way we train these LLMs, the architecture of the transformers and the loss funct, the way you put it is right. It sort of reduces the world into these Bayesian manifolds. Yeah. And as long as the LLM is going in, sort of traversing through these manifolds, it is confident and it can produce something which makes sense. The moment it sort of veers away from the manifold, then it starts hallucinating and starts spotting nonsense. Confident, nonsense, but nonsense.
D
Yeah.
A
So it creates these manifolds and the trick is the distribution that is produced, you can measure the entropy of the distribution. Entropy. The way Shannon distribution share an entropy. Shannon entropy, yeah. Not thermodynamic entropy. So suppose you have a vocabulary of, let's say 50,000 different tokens and you have a distribution, next token distribution over these 50,000 tokens. So let's say the cat sat on the. If that is a prompt, then the distribution will have a high probability for mat or hat or table and a very low probability of, let's say, ship or whale or something like that. So because of the way it's trained, it has these distributions. Now the distributions can be low entropy or high entropy. Yeah. A high entropy distribution means that there are many different ways that the LLM can go with a high enough probability for all those paths. Low entropy means that there are only a small set of choices for the next token. And the prompts. Also, you can categorize into two kinds of prompts. One prompt is, as you can say, high information entropy. Yep. And one prompt is low information entropy. So the way these manifolds work, the LLMs start paying attention to prompts that have high information entropy and low prediction entropy. So what do I mean by that? So when I say I'm going out for dinner, so when I say I'm going out for dinner, that phrase. The LLMs have been trained, they've seen it a lot and there are many different directions I can go with it. I can say, I'm going for dinner tonight, I'm going for dinner to McDonald's, or I'm going to dinner, blah, blah, blah. There are many different. But when I say I'm going to dinner with Martin Casado, you know, the LLM. Now this is information rich. This is sort of a rare phrase. And now the sort of realm of possibilities reduces because Martin is only going to take me to Michelin star restaurants. I'm not going to go to McDonald's. You get what I'm saying? The moment you add more context, you make the prompt information rich, the prediction entropy reduces.
D
Yep, yep, yep, yep.
A
And another example that I often.
D
But just quickly, what. So what, but what is your takeaway? What is your implication on that? Which is, of course, as you're so. Yeah, so you're so. Sorry, sorry, I forgot how you described it. But so the more precise you are, the more tokens you are, I presume, the less options you have for the next token. Is that correct or not correct?
A
Yeah, yeah, essentially.
D
So you're redo, you're reducing it, you're reducing it to a very specific state space when it comes to confidence in an answer. And this is kind of a manifold that you can go on. And then, I mean, do you, do you have kind of a conclusion of what that means for systems or what that means for reasoning, or is it just a nice way to articulate the bounds of LLMs.
A
No, there is something, I don't know, I don't know if I should say profound, but there is something about it which tells what these LLMs can or cannot do, right? So one of the examples that I often tell is, suppose I ask you, what is 769 times 1025? You have no idea. You can have some vague idea given the two numbers, right? And so in your mind, the next token distribution of the answer is going to be diffuse, right? You don't know. You have maybe a vague guess. If you are mathematically very good, maybe your guess is more precise, but it's just going to be diffuse and it's not going to be the correct answer. But if you say, can I write it down and do it the way we have learned multiplication tables? Now you know exactly what to do next step, right? You write 769 and then 1025 and then you know. Exactly. So at each stage of that process, your prediction entropy is very low. You know exactly what to do because you have been taught this algorithm and by invoking this algorithm saying, okay, I'm not going to just guess the answer, but I'm going to do it step by step. Then your prediction and entropy reduces and you can arrive at an answer which you're confident of and which is correct. And the LLMs are pretty much the same way. That's why chain of thought works. What happens with chain of thought is you ask the LLM to do something, chain of thought, it starts breaking the problem into small steps. These steps it has seen in the past. It has been trained on, maybe with some different numbers, but the concept it has been trained on. And once it breaks it down, then it's confident. Okay, now I need to do A, B, C, D. And then I arrive at this answer. What I wear does.
C
Let's zoom back out. I want to get into LLMs. But first, Vishal, maybe you can give more context on your background and how that informs your work here.
A
Okay, so, yeah, as Martin said, my background is very similar to his. We come from doing NETWORKING. So my PhD thesis, my sort of early work at Columbia has all been in networking. But there's another side of me, another hat that I wear, which is both an entrepreneur and a cricket fan.
D
I was gonna say, don't you own a cricket team or something?
A
I'm a minority owner for your local cricket team, the San Francisco Unicorns.
D
Yeah, that's right. Very proud to have you.
A
So, say, in the 90s, I was one of the People who started this portal called Crickinfo. And Crickinfo, at one point, it was the most popular website in the world. It had more hits than Yahoo. That was before India came on. So, you know, we built. Cricket is a very start rich sport. You think baseball multiplied by 1,000. And we had built this free searchable stats database on cricket called Stats Guru. And this has been available on cricket four since 2000. But because you can search for anything, everything was made available on Stats Guru. And you can't expect people to write SQL queries to query everything. So how did we do it? Well, it was a web form.
D
Where.
A
You could formulate your query using that form. And in the back end, that was translated into SQL query, got the results and got it back. But as a result, that because you could do everything, everything was made available. The web form had like 25 different checkboxes, 15 text fields, 18 different dropdowns. The interface was a mess. It was very daunting. So. And ESPN acquired CrickInfo in the mid-2006, I think, but they still kept the same interface. And that has always sort of nagged me. And so I still know the people who.
D
Wait, wait, what nagged you is that crickinfo did not have informal language. It had a web form for doing queries.
A
That web form was terrible because of that only the real nerds used.
D
Of all the things in the world that bother you, the fact that an old website was a web form. I appreciate your commitment to aesthetic.
A
So I'm still friendly with the people who run E.S. winkirk and Father, the editor in chief, Whenever he comes to New York, we meet up, we go out for a drink. And so he was here in 2000. So now the story shifts to how LLMs and me sort of met. So January 2000, right before the pandemic, he was here. And I again said, why don't you do something about Stats Guru? And he looks at me and says, why don't you do something about Stats Guru? He was kind of joking, but he thought maybe, you know, I had some ways to fix the interface. So anyway, then the pandemic hit. The world stopped. But In July of 2020, the first version of GPT3 was released. And I saw someone use GPT3 to write a SQL query for their own database using natural language. And I thought, can I use this to fix Stats Guru? So I got early access to GPT3. You know, getting access those days was difficult, but somehow I got it. But soon I realized that, you know, No, I cannot really do it because, Stats Guru, the backend databases were so complex, and if you remember, GPT3 had only a 2048 token context window. There was no way in hell I could fit the complexities of that database in that context window. And GPT3 also did not do instruction following at that time. But then, in trying to solve this problem, I accidentally invented what's now called rag, where based on the natural language query, I created a database of natural language queries and structure the structured queries. I created a dsl, which then translated into a REST call to Stats Guru. So, based on the new query, I would look through my set of natural language queries. I had about 1500 examples, and I would pick the 6 or 7 most relevant ones and then that and a structured query I would send as a prefix. And the new query and GPT3 magically completed it. And the accuracy was very high. So that had been running in production since September 2021, you know, about 15 months before ChatGPT came and, you know, the whole revolution in some sense started and Rack became very popular. I didn't call it Rack, but this is something sort of I accidentally did in trying to solve that problem for Qwikinfo. Now, once I built it, I was thrilled that this worked, but I had no idea why it worked. I stared at that transformer architecture diagram, I read those papers, but I couldn't understand how or why it worked. So then I started in this journey of developing a mathematical model, trying to understand how it worked. So that's been sort of my journey through this world of AI and LLMs, because I was trying to solve this cricket problem.
C
Amazing. And so maybe reflecting back since the release of GPT3, what has most surprised you about how LLMs have developed?
A
So what has most surprised me? The pace of development. So GPT3 was, you know, it was a nice pilot trick and you had to jump through hoops to get it to do something useful. But starting with the, you know, ChatGPT was an advance over GPT 3. And then you had all these things like chain of thought, instruction following GPT4 really made it polished. And, you know, the pace of development has really surprised me now. You know, when I started working with GPT3, I could sort of see what its limitations were, what I could make it do, what I couldn't make it do. But I never thought of it as, you know, what these LLMs have become for me now and what have become for millions of people around the world. We treat these models as Our co workers, almost like an intern that, you know, you're constantly chatting with them, brainstorming, making them do all sorts of work which we couldn't imagine. You know, just when ChatGPT was released, it was nice. It could write poems, it could write limericks, it could answer some hallucinated questions. But the capabilities that have emerged now, that pace has been very sort of surprising to me.
C
Do you see progress plateauing or how do you either now or in the near future, how do you see it going?
A
Yes, in some sense progress is plateauing. It's like the iPhone when the iPhone came out. Wow, what is this thing? And the early iterations constantly, we were amazed by new capabilities. But the last seven, eight, nine years it's maybe the camera got a little bit better or one thing change here or memory is more. But there has been no fundamental advance in what it's capable of. You can see a similar thing happening with these LLMs and this is not true for just one company and one model. You look at what OpenAI is coming up with or what anthropic Google or all these open source, Chinese model or Mistral. The capabilities of LLMs has not fundamentally changed. They've become better, they've improved, but they have not crossed into a different realm.
D
Vishal, this is something that I really appreciate about your work. And so the thing that really struck me is as soon as these things showed up, you actually got busy trying to have a formal model of what they're capable of, which was in stark contrast to what everybody else was doing. Everybody else was like AGI, these things are going to recursively self improvement, like or, or, or they'll say, oh, these are just stochastic parrots, which doesn't mean anything. So everybody had rhetoric and sometimes this rhetoric was fanciful and sometimes this rhetoric was almost reductionist, like, oh, it's just a database, which is clearly not true. And the thing that really struck me about your work is you're like, no, let's figure out exactly what's going on. Let's come up with a formal model. And once we have a formal model, we could reason about what that means. And then, you know, in my reading of your work, I kind of break it up in two pieces. There's the first one where you basically you came up with this, you know, matrix abstraction. I think it's worth you talking through. And then you took in context learning as an example and you mapped it to Bayesian reasoning, which to me was incredibly powerful because at the time nobody knew why in context learning worked. So I think it'd be great for you to discuss that because again, I think, I think it was the first real kind of formal effect on like, like how are these things working? And then the more recent work that you're working on now is a kind of more generalized version of what is the state space that these models output when it comes to confidence, which is the manifold that we're talking about before. So I think it would be great if you just described your matrix model and then how you use that to provide some bounds. But in context learning is doing what's happening.
A
Okay, so yeah, let's start with that matrix abstraction. So the idea behind the matrix is you have this gigantic matrix where every row corresponds to a prompt. And then the number of columns of this matrix is the vocabulary of the LLM, the number of tokens it has that it can emit. So for every prompt, this matrix contains the distribution over this vocabulary.
D
Yep.
A
So when you say the cat sat on the, you know, the column that corresponds to MAT will have a high probability. Most of them will be zero, but you know, reasonable continuations will have a non zero probability. And so you can imagine that there's this gigantic matrix. Now the size of this matrix is if we just take just the old first generation GPT3 model, which had a context window of 2,000 tokens and a vocabulary of 50,000 next tokens or 50,000 tokens, then the size of it, the number of rows in this matrix is more than the number of atoms across all galaxies that we know of. So clearly we cannot represent it. Exactly. Now, fortunately, a lot of these rows do not appear in real life. Right? An arbitrary collection of tokens, you are not going to use that as a prompt. Similarly, you saw a lot of these rows are absent and a lot of the column values are also zero. Right. When you say the cat sat on the, it's unlikely to be followed by the token corresponding to, let's say, numbers or, you know, an arbitrary collection of tokens. There will be only a very small subset of tokens that can follow a particular prompt. So this matrix is very, very sparse. But even after that sparsity and even after removing the sort of gibberish prompts, the size of this matrix is too much for these models to represent. Even with a trail in parameters. So what in an abstract sense, what is happening is the models get trained on certain data from the training set, and a small subset of these rows, you have reasonable values for the next token distribution. Whenever you give the prompt, something New, then it'll try to interpolate with what it has learned and what's there in the new prompt and come up with a new distribution. But it's basically. So it's more than a stochastic parrot. It is sort of Bayesian on this subset of the metric that it has been trained on. So when I say, you know, I'm going out for dinner with Martin tonight. Now, I'm reasonably sure that it has never encountered that phrase in its training data. Right? But it has encountered variants of this phrase. And given that I'm going out with Martin, it can produce a Bayesian posterior. It uses that evidence that Martin is the one that I'm going for dinner with, and it'll produce a next token distribution that will focus on the likely places that we are going. So this matrix, because it's represented in a compressed way, yet the models respond to everything, every prompt, how do they do it? Well, they go back to what they've been trained on, interpolate there, and use the prompt as sort of some evidence to compute a new distribution, which.
D
The context of the prompt impacts the posterior distribution.
A
Exactly. Yeah, right.
D
And you mapped to Bayesian learning, where the context is the new evidence.
A
New evidence, exactly. To learn from. So I'll give you. So, for instance, the Cricket example that I spoke about earlier. So I created my own dsl, which mapped a natural language query in cricket to this DSL, which then I can translate into a SQL query or a REST API, whatever. But getting the DSL is important. Now, these LLMs, I have never seen that DSL. I designed it. Yeah, right. But yet, after showing a few examples, it learned it. How did it learn it?
D
And this is. This is in the prompt. You didn't know training 100% in the prompt. Right? So, like, the weights are.
A
Yeah, yeah, yeah, yeah. This is. This was happening in October 2020. I had no access to internals of OpenAI. I could just, you know, access their API. OpenAI had no access to internal structure of Stats Guru or the DSL that I cooked up in my head. Yet after showing it only a few examples, it learned it right away. So that's an example where it has seen DSLs or structures in the past. And now using this evidence that I show, okay, this is what my DSL looks like now, a new natural language query. It is able to create the right posterior distribution for the tokens that map to the example that I've seen. Now, the other beautiful thing about this is this is an example of few short learning, or in Context learning, right? But when I give that prompt along with these examples to this LLM, I'm not saying to the LLM, okay, this is an example of few short learning. So learn from these examples, right? You just pass this to the LLM as a prompt and it processes it exactly the way it would process any other prompt, which is not an example of in context learning. So that really means that the underlying mechanism is the same whether you give a set of examples and then ask it to complete a talk a task like an in context learning, or just give it some prompt for continuation that. I'm going out for dinner with Martin tonight. There's no in context learning there. But the process with which it's generating or doing this inferencing is exactly the same. And that's what I have been trying to model and come up with a formal model of.
D
What I've found, very impressive is you've used this basic model to show a number of things, right? To describe context learning and to map to Bayesian learning. But you did it for another one where you kind of, you've sketched out this almost glib argument on Twitter on X where you made this, you made a rough argument for why recursive self improvement can't happen without additional information. And so maybe, maybe just walk through very quickly how like the same model, you can just very quickly show that a model can never recursively self improvement.
A
So another phrase that we have been using recently is the output of the LLM is the inductive closure of what it has been trained on. So when you say that it can recursively self improve, it could mean one of two things. So let's get back to the.
D
Well, actually, you know what's kind of interesting is like often most people agree that if you have one LLM and you just feed the output and the input like it's not going to do anything. But then often people will say, well, what if you have two LLM, you have no external information, but you have two LLMs talking to each other, maybe they can improve each other and then you can have like, you know, a takeoff scenario. But again, you even address this even in the case of like n number of LLMs using kind of the matrix model to show that like you just aren't getting any information. Entropy.
A
Yeah. So you can represent the sort of information contained in these models. And let's go back to that matrix analogy that I have, the matrix abstraction. So like I said, these models represent a subset of the rows, so a subset of the Rows are represented, but some of these rows are able to help fill out some of the missing rows. For instance, if the model knows how to do multiplication, doing the step by step, then every row that is corresponding to, let's say 769 times 125 or whatever, all those things, it can fill out the answer because it has those algorithms embedded in them. You just need to unroll them.
D
Yeah.
A
So it can sort of self improve up to a point. But beyond a point, these models can only sort of generate what they have been trained on. So let me give you, I'll give you three examples.
D
Yeah.
A
So any model, any LLM that was trained on pre1915 physics would never have come up with a theory of relativity. Einstein had to sort of reject the Newtonian physics and come up with this space time continuum. He completely rewrote the rules. So that is an example of AGI where you are generating or generating new knowledge. It's not simply unburling the universe, it's.
D
Not computing, it's actually discovering something fundamental about the universe.
A
Fundamental. And for that you have to go outside your training set. Similarly, you know, any LLM that was trained on it would not have come up with quantum mechanics. Right. That's where particle duality, or this whole probabilistic notion, or that, you know, energy is not continuous, but it is quantized. You had to reject Newtonian physics. Yeah, or Goethel's incompleteness theorem. He had to go outside the axioms to say that, okay, it is incomplete. So those are examples where you're creating new science or fundamentally new results. That kind of self improvement is not possible with these architectures. They can refine these, they can fill out these roles where the answer already exists. Another example which has received a lot of press these days is these IMO results. International Math Olympiad. Whether it's a human solving it or the LLM solving it, they are not inventing new kinds of math. They are able to connect known results in a sequence of steps to come up with the answer. So even the LLMs, what they are doing is they are exploring all sorts of solutions. In some of these solutions they start going on this path where their next token entropy is low. That's where I say they are in that Bayesian manifold where you have this entropy collapse. And by doing those steps you arrive at the answer. But you're not inventing new math, you're not inventing new axioms or new branches of mathematics. You're sort of using what you've been trained on to arrive at that answer. So those things LLMs can do, you know, they'll get better at it, of connecting the known dots. Yeah, but creating new dots, I think we need an architectural advance. Yeah.
C
So Martin was talking earlier about how the discourse, you know, was, it was either stochastic parrots or you know, AGI recursive solving room. How are you, how do you conceive of sort of the AGI discourse or even the concept, what does it mean to the extent that it's useful? How do you think about that?
A
Certain am the way I think about it, the way we have tried to formulate in our papers is it's beyond a stochastic parrot. But it's not AGI, it's doing Bayesian reasoning over what it has been trained on. So it's a lot more sophisticated than just a stochastic parrot.
C
How do you define AGI?
A
Okay, so AGI. So how do I define AGI? So the way I would say that LLMs currently navigate through this non Bayesian manifold, AGI will create new manifolds. So right now these models navigate, they do not create. AGI will be when we are able to create new science, new results, new math. When an AGI comes up with a theory of relativity, I mean, it's an extremely high bar, but you get what I'm saying, it has to go beyond what it has been trained on to come up with new paradigms, new science. And that's my definition of AGI.
D
Vishal, do you think that based on the work you've done, can you bound the amount of data, computer or data or compute that would be needed in order for it to evolve? So one of the problems if you just take LLMs as they exist is there was so much data used to create them. To create a new manifold, we'll need a lot more data just because of the basic mechanisms. Right. Otherwise it'll just kind of like, you know, get kind of consumed into the existing set of data. Like have you found any bounds of, of, of what would be needed to actually evolve the manifold in a useful way? Or do you think we just need a new architecture?
A
I personally think that we need a new architecture. The more data that we have, the more compute we have, we'll get maybe smoother manifold. So it's like a map.
D
Yeah, because I mean there's, there's this view that people have. They're like, well Vishal, this is all, this is all, this is all, you know, good and well. But you know, I could just take an LLM and I can give it eyes and I can give it ears and I can put it in the world and it'll gain information and based on that interface, it'll improve itself and therefore it can learn new things. But the counterpoint that I've always just intuitively thought to that is the amount of data used to train these things is so large. How much can you actually evolve that manifold given an incremental? I mean, almost none at all. Right. There has to be some other way to generate new manifolds that aren't evolving the existing one.
A
I completely agree. There has to be a new sort of architectural leap that is needed to go from the current, just throwing more data and more computer, you know, it's going to plateau. It's, you know, the iPhone 15, 16.
C
17 and are there any research directions that are promising in your mind that might help us, you know, go beyond LLM limitations?
A
So, I mean, again, I love LLMs. They are fantastic and they are going to increase productivity like nobody's business. But I don't think they are the answer. So, you know, Yard Licken famously says that LLMs are a distraction on the road to AGI.
D
They're a dead end. They're a dead end.
A
Aji, I don't think, I'm not quite in that camp, but I think we need a new architecture to sit on top of LLMs to reach AGI. You know, a very basic thing. You know what Martin just said, you give them eyes and you give them ears, you make them multimodal. Of course they'll become more powerful, but you need a little bit more than that. The way human brains learn with very few examples. That's not the way transformers learn. And I'm not saying that we need to create an Einstein or a Gale, but there has to be an architectural leap that is able to create these manifolds. And just throwing new data will not do it. It'll just smoothen out the already existing manifolds.
D
Is that something? So is your goal to actually help like think through new architectures or are you primarily focused on putting formal bounds on existing architectures?
A
A bit of both. I mean, the former goal is the more ambitious one that everybody is chasing. And yeah, I think about that constantly.
C
Are there any new, even like sort of hints at a new architect or like, have we started to make any progress on new architectures or is it.
A
You know, YARN has been pushing at this J Pair architecture, energy based architectures, they seem promising. The way I have been sort of thinking about it is, you know, there's this set of benchmarks or the Ark prize. Yeah, right. That Mike Canoop and Francois Chalet have. And if you understand why the LLMs are failing on this test, maybe you can sort of reverse engineer a new architecture that will help you succeed in that. Right. And I agree with a lot of what several people say that, you know, language is great, but language is not the answer. When I'm looking at catching a ball that is coming to me, I'm mentally doing that simulation in my head. I'm not translating it to language to figure out where it'll land. I do that simulation in my head.
D
So.
A
One of the new architectural things is how do we get these models to do approximate simulations to test out that idea and whether to proceed or not? So, yeah, another thing that I've always wondered about is did we develop as humans, did we develop language because we were intelligent or because we developed language, we accelerated our intelligence. So I don't know which side of the camp your on that.
D
What's interesting is like you have these anecdotal examples of humans developing languages de novo that have been recorded. Right. Like it's either the Guatemalan or Nicaraguan sign language, right. Where there is these students that develop their own language without being taught. And so that would suggest that languages follows intelligence. The problem is, is they're all anecdotal. Right. Like, who knows if somebody didn't teach them sign language? Like, nobody really knows. There is no controls. So this is all observational studies. And there's so few of them, you have to wonder if it's just kind of sloppy observation. And so I think that the question is still outstanding.
A
Yeah, So I mean, language definitely accelerated our intelligence, there's no question about that. But which followed? Which we don't know.
D
I view it as a networking problem, naturally, which is once you have languages, you can communicate and when you communicate.
A
You can store, you can replicate. Yeah, yeah, yeah, exactly, exactly.
D
Right, cool. Again, this is kind of a wonky question, but I think one thing that you've brought to the discourse and for those that are listening to this, I really think that you should look up Vishal's work and read it. I just think it'll give you a really, really. Especially if you have a systems background, like a networking systems background, give you a really, really good understanding of kind of the bounds on these. But like the toolkit that you draw from is like information theory and like more formal. Have you found that the AI community is receptive to this? Or is it like two different cultures, two different planets trying to communicate and not a lot of common ground. Like how have you found like bringing like the networking view of the world to the AI realm?
A
Some of them are receptive to it, definitely. But you know, these large conferences and their reviewing processes, it's so random. And the kind of questions they ask. You know, I'm a modeling person, I like to model things. And you know, I submitted one version of this work to one very famous machine learning or AI conference and the reviewer said, okay, this is a model. So what? So that is.
D
That'S absolutely remarkable. So like you've actually taken a system that nobody understands, we have no models for you actually provided some model that we can use to analyze it and that alone wasn't sufficient.
A
They're asking, so where are the large scale experiments to prove this?
D
I do listen, I honestly, I mean, I find there's so much empiricism in the current, you know, AI community. Exactly. Because we don't understand the systems. You know, it kind of reminds me, I, I feel like, I feel like systems went the other way. Right. It's like we had all of these models, but then we didn't understand how the systems worked and then we just like actually did measurement. It feels like ML and or the AI stuff is the opposite, which is like we know we don't understand them and so we just measured them, but now we're trying to like come up with the models.
A
Yeah, exactly. So it was so easy in some sense to build these artifacts and then just measure them that people have been going around trying to do that. And one term I really dislike is prompt engineering. Why Engineering used to mean sending a man to the moon or providing five nines reliability. Prompt engineering is prompt twiddling. You fiddle with a prompt and the model changes and the inference, the output changes and you have like hundreds of papers just doing one XY on the other, changing a prompt this way, that way, and writing their observations. And as a result, lots of these papers are being written, are being submitted for review. Reviewers get busy looking at all this kind of empirical work. And my personal taste is to first try to understand. Model it.
D
Yeah.
A
And then you can do the other.
D
Sound like a true theory guy. I don't know about this bit twiddling.
C
Let me ask one more LLM question which is are there any benchmarks or real world tasks that if they, if they occurred, you'd sort of reevaluate and say, hey, maybe LLMs are, you know, closer to the path to AGI than I thought.
A
If there were any real world Tasks. Good question. Which for LLMs or these models, the one domain where you have the most training data is probably coding. And coding is where you can also have the most structure. And yet anyone who has used these tools, whether it's cursor or whatever, cloud code, LLMs continue to hallucinate, continue to generate unreasonable code, you know, you have to constantly babysit these models. So the day an LLM can create a large software project without any babysitting is the day I'll be a little bit convinced that it's towards easier. But again, I don't think it'll be able to create new science. If it does, that's when I'll be convinced.
D
I think that you can almost take a definitional approach to answer this question. Vishal, the problem with these types of questions is if you have billions of dollars and you can collect whatever data you want, you can make a model do anything you want. Right? And so you know what I'm saying, at some level you've got this entire capital structure machinery behind these models. So you're like, oh, it can be good at science. Well, sure, you put a billion dollars at solving materials science and collect all this data, you'll be good at material science or whatever it is. But there is a definitional answer which is, and I'm going to draw from your work, which is there is a manifold that's in there based on the data it's been training on. And then the question is if it ever produces something that's off, like a new manifold. So considering the existing training data, if it ever does that, if it does something that's outside of that distribution, then clearly we're on a path to learning new things. And if not, then everything is just a computational step from what's already known.
A
Yeah, that's all.
D
I mean, and then I guess the counter to that would be maybe all humans do is work on their own manifold. And Einstein, you know, was lucky or something I guess would be the counter to that.
A
But you know, there's several mini Einstein examples and yeah, it's creating this new manifold. I didn't want to use that definitional answer. I thought it might sound too, yeah, too wonky, too mathematical. But essentially if LLMs really created this new manifold, then I would be convinced. But so far they have just gotten better at navigating the existing manifold, the existing training set, which is hugely powerful.
D
And it's going to change the world.
A
Which is hugely powerful. I'm not denying that. I think they're extremely, extremely good at what they can do, but there's a limit to what they can do.
D
So I have one quick question. What's next for you? I mean, you've tackled in context learning, you've got a model for LLMs and now you've got a general generalized model for like, you know, like their solution space. What are you thinking about tackling next in terms of modeling or academically, an LLM?
A
Academically. Academically. I'm, you know, I'm thinking of this. What is the architectural leap that is needed?
D
Oh, that's exciting.
A
To create this, you know, manifold and how do we use, you know, multimodal data?
D
Awesome.
A
To. To expand your.
C
Come back and talk to us.
D
That's right. We love that.
A
So, I mean, you know, even with LLMs, you know, in the paper we say that you can improve the inference by following this low or minimum entropy path. So that's a very sort of small step that we are taking. We are building and training models that will do inference based on the entropic path. Yeah.
D
By the way, is Model Probe still up?
A
Token Probe? Yeah. Yeah, Token Probe is still up. And you can see actually the, you know, Token Probe is a software that we built and thanks to Martin and A16Z's generosity, it's running on your servers and anyone can go and test. And what we have done there is we actually show the entropy.
D
Yeah. It is so enlightening. I recommend anybody listening to this who's interested actually check out Token Pro. It literally shows you the confidence. Yeah. As you go along. It's remarkable.
A
So in context learning, you create your new DSL and you give it to the prompt and you can see the confidence rising with each new example, the entropy reducing. And that sort of is a validation of the model. You can see it sort of unfurling and right in front of your eyes, the token probe in scale running. Thanks. Thanks again.
C
Vishal. Thanks so much for coming on the podcast. It's a great conversation.
A
It was great fun. Thank you. Thank you so much. Again.
B
Thanks for listening to this episode of the A16Z podcast. If you liked this episode, be sure to like, comment, subscribe, leave us a rating or review and share it with your friends and family. For more episodes go to YouTube, Apple Podcasts and Spotify. Follow us on X16Z and subscribe to our substack@a16z.substack.com thanks again for listening and I'll see you in the next episode. As a reminder, the content here is for informational purposes only. Should not be taken as legal, business, tax or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see a16z.com disclosures.
Date: October 13, 2025
Host: Andreessen Horowitz
Guests: Prof. Vishal Misra (Columbia University), Martin Casado (a16z General Partner)
This episode dives deep into the capabilities and limitations of Large Language Models (LLMs) with Columbia Professor Vishal Misra, whose formal models provide a rigorous way to understand how LLMs reason—and why, despite impressive advances, today’s models are fundamentally constrained. The conversation traverses the history of retrieval-augmented generation (RAG), the mathematical structure of LLM outputs, and the crucial distinction between synthesizing existing knowledge versus true scientific discovery (AGI). The episode features rich technical discussion, real-world analogies, and arguments about the future—and limits—of current AI paradigm.
| Timestamp | Segment / Quote | |-------------|------------------------------------------------------------------------| | 00:00 | Definition of AGI — “create new science, new results, new math” | | 04:18–04:43 | LLMs confidence, manifolds, and hallucinations | | 13:08 | Misra’s accidental discovery of RAG via Crickinfo | | 21:17 | Matrix abstraction of LLM reasoning | | 26:15 | Bayesian mechanism underlying in-context learning | | 29:19–31:00 | Mathematical proof that self-improvement is not possible without new data| | 33:46 | Misra’s definition of AGI | | 36:46 | Limitation of LLMs and the need for architectural leaps | | 38:10–39:17 | Need for simulation and other modalities | | 44:31 | Coding as a true AGI benchmark | | 47:07 | Creation of new manifolds as a marker of true intelligence |
Summary compiled in the spirit of the technical depth and candid humbleness of the conversation. For listeners interested in the mathematical underpinnings, formal boundaries, and future prospects of AI and AGI, this episode is a must-listen.