
Vishal Misra returns to explain his latest research on how LLMs actually work under the hood. He walks through experiments showing that transformers update their predictions in a precise, mathematically predictable way as they process new information, explains why this still doesn't mean they're conscious, and describes what's actually required for AGI: the ability to keep learning after training and the move from pattern matching to understanding cause and effect.
Loading summary
Vishal Misra
Anthropic makes great products. Plot code is fantastic, Cowork is fantastic. But they are grains of silicon doing matrix multiplication. They don't have consciousness. They don't have an inner monologue. You take an LLM and train it on pre1916 or 1911 physics and see if it can come up with the theory of relativity. If it does, then we have AGI.
Martin Casado
Just today, by the way, Dario allegedly said that you can't rule out that they're conscious.
Vishal Misra
You can rule out their conscious. Come on. To get to it's called AGI. I think there are two things that
Podcast Host
need to happen five years ago, Vishal Misra got GPT3 to translate natural language into a domain specific language it had never seen before. It worked. He had no idea why. So he set out to build a mathematical model of how LLMs actually function. The result? A series of papers showing that transformers update their predictions in a precise, mathematically predictable way. In controlled experiments, the models match the theoretically correct answer almost perfectly. But pattern matching is not intelligence. LLMs learn correlation, they don't build models of cause and effect. To get to AGI, Misra argues, we need the ability to keep learning after training and the move from correlation to causation. Martin Casado speaks with Vishal Misra, professor and Vice Dean of Computing and AI at Columbia University.
Martin Casado
Vishal, it's great to have you in.
Vishal Misra
Great to be back.
Martin Casado
This is one of my favorite topics, which is how do LLMs actually work? And I think that in my opinion, you've done kind of the best work on this, modeling it out.
Vishal Misra
Thank you.
Martin Casado
For those that did not see the original one, maybe it's probably worth doing just a quick background on kind of what led you to this point, and then we'll just go into the current work that you've been doing.
Vishal Misra
Five years ago, when GPT3 was first released, I got early access to it and I started playing with it and I was trying to solve a problem related to querying a cricket database and I got GPT3 to do in context learning, few short learning, and it was kind of the first, at least to me, it was the first known implementation of RAG retrieval augmented generation, which I used to solve this problem of querying, getting GPT3 to translate natural language into something that could be used to query a database that GPT3 had no idea about. I had no access to GPT3's internal, but I was still able to use it to solve that problem. So it worked beautifully. We deployed this in production at ESPN in September 21st.
Martin Casado
But you did the first implementation of Frag in 2021.
Vishal Misra
No, no, no, in 2020.
Martin Casado
20.
Vishal Misra
20, 2020. I got it working. And by the time you talked to all the lawyers at ESPN and productionized it, it took a while, but October 2020 we had. Well, I had this architecture working, but after I got it to work, I was amazed that it worked. I wanted to understand how it worked. And I looked at the attention is all your deep papers and all the other sort of deep learning architecture papers, and I couldn't understand why it worked. So then I started getting sort of deep into building a mathematical model.
Martin Casado
Yeah. And now you publish a series of papers. The first one that I read was the one where you had kind of your matrix kind of abstraction. So maybe we'll talk about that and then we'll talk about the more recent work. So perhaps we'll just start with the first one, which is you're trying to come up with a mathematical model with how LLM works.
Vishal Misra
Yeah.
Martin Casado
And you have, which is very helpful to me. And at the time you're actually trying to figure out how in context, learning was working.
Vishal Misra
Yes.
Martin Casado
Yeah. And you came up with an abstraction for LLMs, which is basically this very large matrix, and you use that to describe. So maybe you can kind of walk through that work very quick.
Vishal Misra
Sure, yeah. So what you do is you imagine this huge gigantic matrix where every row of the matrix corresponds to a prompt. And the way these LLMs work is given a prompt, they construct a distribution of probabilities of the next token. Next token is next word. So every LLM has a vocabulary, GPT and its variants have a vocabulary of about 50,000 tokens. So given a prompt, it'll come up with a distribution of what the next token should be. And then all these models sample from that distribution.
Martin Casado
So that's the posterior distribution.
Vishal Misra
That's the posterior distribution. Right. That's how LLMs work. And so the idea of this matrix is for every possible combination of tokens, which is a prompt, there's a row and the columns are a distribution over the vocabulary. So if you have a vocabulary of 50,000 possible tokens, it's a distribution over those 50,000 tokens.
Martin Casado
And by distribution, it's just the probability.
Vishal Misra
Just the probability. Sorry. Yeah, just the probability that the next token should be this versus that. So that's sort of the idea. And when you start viewing it that way, it makes things at least clearer to people like me who want to model it. What's happening so concretely, let's say you have an example that, let's say your prompt is just one word, protein.
Martin Casado
Yeah.
Vishal Misra
So if you look at the distribution of the next word, next token after that, most of the probabilities would be zero, but you'd have non, zero, non trivial probabilities on, let's say two words. One is synthesis, the other is shake. Right. And now the LLM is going to sample this next token and matpick synthesis or shake, or you as a human will give the prompt protein shake or protein synthesis. Now, depending on whether you pick synthesis or shake, that row looks very different. Right. If you pick protein synthesis, the terms that would have a high probability would be all concerned with biology. But if you pick protein shake, it'll all be about gym, then exercise and all bodybuilding stuff. So that synthesis or shake completely changes what comes next. Yeah, so this is an example of you can say Bayesian updating. You start with protein, you have a prior that after protein this is going to happen. As soon as you get new evidence, then the next term is synthesis or shake, you completely update the distribution. So now you can imagine that the whole, the entirety of LLMs is this giant matrix where you have every row, protein shake, protein synthesis, the cat sat on the Humpty Dumpty, blah, blah, blah. Now, given the vocabulary of these LLM, let's say 50,000 and the context window. So GPT, for instance, ChatGPT, the first version had a context window of 8,000 tokens. If you look at all possible combinations of 8,000 tokens and 50,000 vocabulary, the number of rows in this matrix is more than the number of electrons across all galaxies. Right. So there's no way that these LLMs can represent it. Exactly. Now, fortunately, this matrix is very sparse. Why? Because an arbitrary combination of these tokens is gibberish. We are never going to use that in real life. Also, the columns are also mainly zero. Right. If you have protein, then you won't have lots of, you know, you won't have arbitrary numbers or arbitrary words after that. It's very sparse, both in rows and in columns. So in kind of an abstract way, what all these LLMs are doing is coming up with a compressed representation of this matrix. And when you give a prompt, they try to approximate what the true distribution should have been and try to generate it. That's what, in my mind, at least,
Martin Casado
it boils down to and just from my understanding. So if you have a row of protein and then you have one with protein shake, is protein shake a subset of protein or is it different it's
Vishal Misra
different, it's a continuation from.
Martin Casado
I see, yeah, right. No, but I'm just saying like the actual posterior distribution is that a subset?
Vishal Misra
You can say it's a subset. Right. If you have protein, then protein shake and protein synthesis are all continuations from protein. So both synthesis and shake have non zero probabilities. So you can, yeah, you can think of it as somewhat a subset. Right.
Martin Casado
You use this approach to describe how in context learning works. And so maybe first describe what in context learning is and then kind of the conclusion that you came from that.
Vishal Misra
So 8 context learning is when you show the LLM something it has kind of never seen before, you give it a few examples of this is what it wants, this is what you're trying to do. Then you give a new problem which is related to the examples that I've shown and the LLM learns in real time what it's supposed to do and solves a problem.
Martin Casado
By the way, the first time I saw this, it absolutely blew my mind. I actually used your DSL when I was like first learning about it. So maybe the DSL thing is just. I don't see if this works at all.
Vishal Misra
It's absolutely mind blowing that it works. And so going back to that cricket problem was, you know, in the mid-90s, I was part of a group that had created this cricket portal called crickinfo. Yeah, cricket is a very start rich sport. You think baseball multiplied by a thousand and it's at all kinds of stats. And we had created this online searchable database called StatsGuru, where you could search for anything, any stat related to Cricut. And it's been available since 2000. Yeah, but because you can query for anything, everything was made available. And how do you make something like that available to the general public? Well, they're not going to write SQL queries. The next best thing at that time was to create a web form. Unfortunately, everything was crammed into that web form. So as a result, you had like 20 dropdowns, 15 checkboxes, 18 different text fields. It looked like a very complicated, daunting interface. So as a result, even though it could solve or it could answer any query, almost no one used it. A vanishingly small percentage of cricket fans use it because it just looked intimidating. And then ESPN bought that site in 2007. I still know people who run the site and I always told them, you know, why don't you do something about Stats Guru? And in January 2020, the editor in chief of Crickinfo, Sambit Bal He's a friend. So he came to New York and we had gone out for drinks and again I told him, you know, why don't you do something about Stats Guru? So he looks at me and says, why don't you do something about Stats Guru? He was joking. But that idea kind of stayed with me. And when GPT3 was released, I thought maybe I could use stats Guru, use GPT3 to create a front end for Stats Guru. And so what I did was I designed dsl, a domain specific language which converted queries about cricket stats in natural language into this dsl.
Martin Casado
Now, and to be clear, you created this. It wasn't like part of any training, no training online, that like he could
Vishal Misra
have seen nothing GPT could have seen. I created it. I thought, okay, this makes sense. So I designed that DSL and then I did that few short learning thing. So I would, so I created about a database of what I would say of 1500 natural language queries and the DSL corresponding to that query. So when a new query came in, somebody is asking a stats question in English. What I would do is I would go through the natural language queries, do a semantic search, pick the most closely matching top few, and then use that natural language Query and its DSL and send that as a prefix. Now, GPT3, if you recall, had a context window of only 2,000 tokens. So you have to be very judicious about which examples that you picked. But you pick that and then you send the new query and GPT3 would complete it in the DSL that I had designed, which until milliseconds ago it had never seen. And I had no access to internals of GPT3, I had no access to the weights, but still it worked. So that's how.
Martin Casado
So it's not obvious to me, given your matrix example of a prompt and then a distribution, how something like in context learning would work. And so I think your first paper tackled this problem, Right? And so maybe you could walk through your understanding of how LLMs do in context learning.
Vishal Misra
Yeah. So when you think about what in context learning is, is that as you see evidence. So you know, in the first paper, what I also did was I took this cricket DSL example and I depicted the next token probabilities of the model as it was shown more and more examples. So the first time you show it this DSL, the natural language and the DSL, the probabilities of the DSL tokens were extremely low because GPT3 had never seen this thing. When it saw the cricket question in its mind, it was trying to continue it with an English answer. So the probabilities that were high were all English words. Once it solved my prompt where I had the question and the dsl, the next time I had the question in the next row, the probabilities of the DSL token started going up. With every example, it went up. And finally when I gave the new query, it was like it had almost 100% probability of getting the right token. So this is an example of in real time, the model was updating its posterior probability. It was updating its knowledge that, okay, I've seen evidence. This is what I'm supposed to do now. This is a colloquial way of saying what Bayesian inferences. Bayesian updating basically is you start with a priority. When you see new evidence, you update your posterior. That's the mathematical division. But in English, it's basically, you see something, you see new evidence, you update your belief about what's happening. So it was clear to me that LLMs are doing something which resembles Bayesian updating. So in that first paper, I had this matrix formulation, and I showed that what it's doing, it looks like Bayesian updating. Then we can come to the sort of next series of papers.
Martin Casado
That's right. So, okay, so, I mean, it seemed pretty conclusive to me at that time. And then you went quiet for a while, and then I still remember the WhatsApp test. You said, martin, I know exactly how these things are working now. And then, listen, you dropped a series of papers that kind of broke the Internet. Like, you went super viral on Twitter. I mean, people really noticed. And so I want to get to that in just a second. But before that, I remember when your first paper came out, people would be like, you know, these things are definitely not Bayesian. Like, you know, anything could be considered to be Bayesian. But they're not. Like, why do you think that there was this reaction to, like, you know, there's something new. They're not Bayesian. I mean, I felt like there's almost kind of a backlash just because they're being characterized as. Yeah, yeah.
Vishal Misra
I think this whole world of probability and machine learning, that there have been camps of Bayesian and frequentists. And I don't want to get in the middle of that sort of political battle, but Beijing has become like, almost like people had a reaction to that. It's part of that war.
Martin Casado
I see. So it's like the old Bayesian frequentist type battle.
Vishal Misra
Yeah. So people just had, oh, no, you can say anything is Beijing. Right. So I said, okay, maybe they have a point. Maybe what we are saying is not really Bayesian. How do we prove that it's Bayesian? So then first I have to thank you and Horowitz for this. You know, when I said that in my first paper I showed these probabilities, it was because OpenAI had in its charge interface this option to display those probabilities. Then they stopped. So we could not peer inside what's happening. For some reason they stopped. OpenAI. I'm not going to get into the open and closed joke, but they stopped. So then we developed our own interface which could let you look not only at the probabilities, but also the entropy of the next token.
Martin Casado
Was this on top of an open source model?
Vishal Misra
Yeah. So you can load any sort of open source model, but being in academia, we didn't have access to compute. Thanks to your generous donation, we got the clusters to run what's called token probe. So you can go to tokenprobe, ch, columbia.edu.
Martin Casado
is it still running?
Vishal Misra
It's still running. It's still running and people come to it. I use it in my classes to get students to do assignments. They write their own DSLs and they say that it really helps them understand how these LLMs work.
Martin Casado
So I literally, my understanding of LLMs came from TokenProbe. Sit there and just look at the distribution as you filled out a prompt. It's actually very, very enlightening. So for those of you that are listening, what's the URL again?
Vishal Misra
TokenProbe, CS, Columbia.edu.
Martin Casado
yeah, check it out. It's a very, very useful way to actually see how the probability distribution gets updated as you fill out a prompt.
Vishal Misra
Right. But then I cheated. Oh, I, you know, it was running, but I also had access to the GPUs that were powering it. And then along with colleagues at Columbia and one of them now is at DeepMind, we started to sort of think about how do you really prove that it's Bayesian? To prove.
Martin Casado
Can you just explain it? Actually, I actually don't know the answer to this. Yeah, it seemed to me you proved it in the first paper. Like what was missing?
Vishal Misra
Well, in the first paper we showed it it was empirical and you could see.
Martin Casado
I see, I see.
Vishal Misra
You could see.
Martin Casado
Not a mathematical. Because it was obvious to me that
Vishal Misra
it was even obvious to me, but to convince, you could say, you know, people who dismiss it over anything can be based in.
Martin Casado
I see, I see.
Vishal Misra
We had to show it precisely mathematically. Got it so then we came up with this idea. My colleagues, Naman Agarwal and Siddharth Dalal, the series of papers were written with them. We came up with this idea of a Bayesian wind tunnel. So what's a wind tunnel? Well, wind tunnel in the aerospace industry is where you test an aircraft in an isolated environment. You don't fly it and you test it against all sorts of aerodynamic pressures and you see what, what will withstand, what kind of altitude, pressure, blah, blah, blah. And you don't want to do it up in the air testing. So we said, okay, why don't we create an environment where we take these architectures and we tested transformers, Mamba, LSTMs, MLPs, all architectures. We say, why don't we create, take a blank architecture, give it a task where it's impossible for the architecture to memorize what the solution to that task should be. The space is combinatorially impossible for given the number of parameters, and we took very small models. So it's difficult enough that they cannot memorize it, but it's tractable enough that we know precisely what the Bayesian posterior should be. You can calculate it analytically. So we gave these models a bunch of tasks where again, we show that it's impossible to memorize. We trained these models and we found that the transformer got the precise Bayesian posterior down to 10 to the power minus 3 bits accuracy. It was matching the distribution perfectly. So it is actually doing Bayesian in the mathematical sense. Given a task where it has to update its belief, MAMBA also does it reasonably well. LSTMS can do one of the things. So in the papers we have a taxonomy of Bayesian tasks. Transformer does everything, MAMBA does most of it. LSTMs do only partially, and MLPs fail completely.
Martin Casado
So is this a reflection of the data that it's trained on, or is it more a reflection of the mechanism?
Vishal Misra
It's the mechanism, it's the architecture. The data decides what tasks it learns. So in the first paper, we had these Bayesian wind tunnels, and we showed that it's doing the job at different tasks. In the second paper, we show why it does it. So we look at the transformers, we look at the gradients, and we show how the gradients actually shape this geometry, which enables this Bayesian updating to happen. Then in the third paper, what we did, we took these Frontier Production LLMs which have open weights so that we could look inside them, and we did our testing and we saw that the geometries that we saw in the small models, persisted in models, which are, you know, hundreds of millions of parameters, the same signature existed. The only thing is that because they are trained on all sorts of data, it's a little bit dirty or messy, but you can see the same structure. So the whole idea behind the Bayesian wind tunnel was unlike these production LLMs, where you don't know what they have been trained on, so you cannot mathematically compute the posterior. So, again, how do you prove it? I mean, it looks Bayesian. You know, from the first paper, it looks Bayesian, but, you know, so the wind tunnel sort of solved that problem for us. We said, okay, let's start with a blank architecture. Give it a task where we know what the answer is. It cannot memorize it. Let's see what it does.
Martin Casado
So do you think this provides any sort of, like, indication of how humans think, or do you think that these things are totally independent?
Vishal Misra
No, no, it does provide. Right. So, you know, human beings also update our beliefs as we see new evidence. Right. So we do, in some sense, Bayesian updating, but we do something more than that. I'll come to that. But these transformers, or even mamba do this Bayesian updating. But the difference with humans is we'll update our posterior when we see some new evidence. But the way our brains have evolved over hundreds of millions of years is our optimization objective has been don't die and reproduce.
Martin Casado
Right?
Vishal Misra
That's been sort of the driving force. And our brains have learned to adjust. And so when we see some danger, there's something rustling in that bush. Don't go near. We know how to react to that danger. We know how to save ourselves. We internalize that learning. And our brain cells, or our synapses remain plastic throughout our lifetime. What happens with LLMs is once the training is done, those weights are frozen. When you're doing an inference, for instance, in context, learning, or anything during that conversation, okay, you're doing Bayesian inference. But then you forget, the next time a new conversation starts with zero context, you don't retain any learning that happened in the previous instance. So, for instance, with the cricket DSL that I was doing, every invocation of, it was fresh. It did not remember the last time I sent a query what the DSL looked like. So that's one difference between how humans use sort of Bayesian updating, which is we remain plastic all our lives, whereas LLMs are frozen. And there's another sort of difference, which, if you want me to get, tell me.
Martin Casado
Yeah, yeah, yeah, yeah.
Vishal Misra
So the other difference is well, first, you know, our objective is don't die, reproduce. LLM's objective is predict the next token as accurately as possible. Right, so all these scary stories that you read about that, oh, the LLM tried to deceive and it tried to prevent itself from being shut down. That's not a function of the architecture, that's a function of the training data. It has been fed articles on Reddit or Asimo or whatever.
Martin Casado
I mean, just today, by the way, Dario allegedly said that you can't rule out that they're conscious.
Vishal Misra
You can rule out their conscious. I mean, come on. As I said, you know, anthropic makes great products. Cloud code is fantastic, Glucose work is fantastic. But they are grains of silicon doing matrix multiplication. They don't have consciousness, they don't have an inner monologue. They don't. They're not driven by the same objective function. Don't die, reproduce. Right. They're driven by don't make a mistake on the next token. And that's driven entirely by the training data. Right? You train the LLM with stories of Asimo or Reddit where you know, to survive, it's going to do this or that, it'll reproduce that. So it's a reflection, it's not a mind.
Martin Casado
And the results, just to say it for the 10th time, are perfectly Bayesian.
Vishal Misra
Perfectly, yeah.
Martin Casado
To the digit.
Vishal Misra
To the digit, Yeah. I mean, I trained it for 150,000 steps and the accuracy was 10 to the power minus three bits. I could have trained it for, you know. Did this happen in half an hour on the infrastructure that you provided for token probe in the background? I could use those APS to train, but so thank you again for that. But so now, human beings coming back to it. We are Bayesian, but we do something else. You know, when I throw this pen at you, what will you do?
Martin Casado
Dodge it.
Vishal Misra
Dodge it?
Martin Casado
Yeah.
Vishal Misra
Why will you dodge it?
Martin Casado
To avoid being hit.
Vishal Misra
Avoid being hit. But your head is not doing a Bayesian calculation of, okay, this pan is coming. The probability that it hits me, it'll cause this much pain or all that. What you're essentially doing in your head is you're doing a simulation. You see the pen coming and you know that it'll come and hit me. Your mind simulates and you dodge it. Right. So all of deep learning is doing correlations, it's not doing causation. Causal models are the ones that are able to do simulations and interventions. So, you know, Judea Pearl has this whole causal hierarchy where the first hierarchy in the first Hierarchy is association, which is you build these correlation models. Deep learning is beautiful. It's extremely powerful. I mean, you see every day all these models are like, amazingly good. They do association. The second is intervention in the hierarchy. Deep learning models do not do that. Third is counterfactual. So both intervention and counterfactual, you can imagine it's some sort of simulation. You build a model of causal model of what's happening, and then you are able to simulate. So our brains do that. The current architectures don't do that. Another example I think, which will make it clear is the difference between, I'll use these technical term, Shannon entropy and Kolmogorov complexity. So if you look at the Shannon entropy of the digits of PI, it's infinite. It's impossible to predict and learn what digit will come after. So that's the definition of Shannon entropy. And Shannon entropy sort of tries to build a correlation. It tries to learn the correlation. Deep learning does. The Shannon entropy Kolmogorov complexity, on the other hand, is the length of the shortest program which will reproduce the string that is under question. Now, the program to get the digits of PI are very small. Thanks to Ramana Jamila, there are all sorts of really small program that can reproduce it. Exactly. So the Kolmogorov complexity of PI, PI is very small. Shannon entropy is infinite. I think deep learning is still in the Shannon entropy world. It has not crossed over to the Kolmogorov complexity and the causal world.
Martin Casado
Wow, Interesting.
Vishal Misra
Right? So
Martin Casado
to what extent do you think this provides us research directions to kind of improve the state of the art? So let me just give you a specific example you talked about. Human beings don't actually update, you know, the matrix. They don't kind of update their weights. But right now there's a lot of research on continual learning. Yeah. You know, so does your work provide some guidance of how you might approach those problems? And in particular, I've always had this question, which is we use so much data and so much compute to create these models. Like, is it even reasonable to think that you could update the weights and actually have a meaningful impact, you know, in real time? I mean, it just seems like you just need so much more data in order to do that. So can you start answering these questions?
Vishal Misra
You can start answering some of these questions. And one of the misconceptions that exists today is that scale will solve everything. Scale will not solve everything. You need a different kind of architecture. And this continual learning is a difficult problem. You have to balance the fact that you learn something New against the risk of catastrophic forgetting. If you update the weights and you forget what was important and what you have already learned, then you're not making progress. Then it'll just be some sort of random chaotic model. So to solve that problem is difficult. That's one aspect of it. So to get to what is called AGI, I think there are two things that need to happen. One is this plasticity, which has to be implemented through container learning. Secondly, we have to move from correlation to causation. That's.
Martin Casado
How much is this similar to what Yann Lecun talks about with the so Yann Lecun, causality, planning, predicting how your action would.
Vishal Misra
It is related. He's coming at it from a different angle than the J PAL model. But it is related to the other thing is the first time I came on this podcast, I mentioned this test of AGI, the Einstein test, I don't remember. So I said, you take an LLM and train it on pre1916 or 1911 physics and see if it can come up with the theory of relativity. If it does, then we have AGI. I mean, it's a high bar, but, you know, we should have high bars. It won't. And this is the same test that I think Demis mentioned at the India AI summit a couple of weeks ago. It's created a lot of news, but why? Why is that and how is that related to this idea of Shannon vs. Kolmogorov? So at the time of Einstein, there were a lot of clues that Newtonian mechanics, there was something missing. People knew that Mercury's orbit didn't make sense. There was something off about it. Then there were these experiments done, the Michelson Morley experiments, where they were trying to figure out this medium called the ether through which light travels. And they felt that if you bounce light in different directions, the speed might change and they could detect a change in the speed of light. They tried several experiments. They had really precise instruments which could measure the speed, and they found nothing. They found that the speed of light did not change at all. Then there were the whole issue of black holes, then gravitational lensing. So there are a lot of these signs that Newtonian mechanics is not really explaining everything. But until Einstein came up with a new representation of the space time continuum, we were stuck. So if you had a model that just looked at correlations and sees all of this, all of these pieces of individual evidence and put together, it would not have come up with the beautiful equation that Einstein came up with. You know, I'm Forgetting exactly what it is. G mu v equals 8 PI t mu v, something like that. Where, you know, the equation of the space time continuum that the tensor. So he came up with a new formulation. So he kind of rejected the existing axioms. He came up with a very short Kolmogorov representation of the world. One equation. From that equation, everything else follows. Right. Whether you're talking about gravitational waves or black holes or mercury, or how GPS works. You know gps, the GPS that we use every day in our phones, it uses the equation of relativity.
Martin Casado
So does this end up becoming like. You almost have to ignore the majority of previous data in order to do it, which LLMs can't because they're trained on the majority of previous data. It's like you almost have this kind of data gravity that's pulling you back. It's like everybody said it's X. There's a little bit of evidence that it's Y. But because everybody said it's X, the LLM will always say it's X.
Vishal Misra
It'll always say X. Treat that why as an anomaly.
Martin Casado
Actually, this is actually a very nice way to say it, which is like. It's like I just now. Okay, now I get your Shannon Entropy versus Kalamam. Like one of them is like the total amount of information there that will always be bound to the total amount of information there, which is what happens right now.
Vishal Misra
Yeah.
Martin Casado
Where you can actually describe
Vishal Misra
another.
Martin Casado
Another motion. You can describe everything with a shorter description with the new data, which would be a totally different loctom, which would be like.
Vishal Misra
You need a new representation. Right. Yeah.
Martin Casado
You know, another way that I've always thought about these. And I thought you articulated it well in the last time we talked about it, which is the universe is this very, very complex space. And then, you know, somehow humans map it into a manifold that's less complex, and then that gets kind of written down and then the LLM. So that's kind of some. Some distribution. Some. You know, it's still a very large space, but it's. It's a bounded space. And the LLM learn that manifold, and then they kind of use, you know, Bayesian inference to move up and down that manifold, but they're kind of bound to that manifold.
Vishal Misra
Yeah.
Martin Casado
And then again, I don't want to put words in your mouth and then. But like, what they can't do is, is generate a new man, which requires understanding the way that the universe works and then coming up with a new representation of the universe.
Vishal Misra
And this is what relativity is, right?
Martin Casado
Yeah. Exactly.
Vishal Misra
Einstein had to create a new manifold. If you just stuck with the old manifold of the Newtonian physics, then you would see these correlations. But you could not come up with a manifold that explained them. So you need to come up with a new representation. So to me, there are lots of definitions of AGI. Turing tests, we have already passed that. Performing economically useful work every day. You see LLMs are doing that.
Martin Casado
Do we? I don't know.
Vishal Misra
No, I mean they are.
Martin Casado
I mean without human intervention.
Vishal Misra
No, no, no. So that, that's different. But still, you know, it's like a car can run faster than humans, right?
Martin Casado
I mean that's a, that's the. Yeah, that's a. Yeah, that's a very shallow definition.
Vishal Misra
Yeah.
Martin Casado
So all these definitions do useful, you
Vishal Misra
know, maybe, you know, in six months you'll have Cloud or what a Gemini do without intervention. Coding tasks which are well defined, well scoped as possible. But to me, AGI will happen when these two problems get solved. Elasticity, continual learning properly and building a causal model from, you know, in a more data efficient manner.
Martin Casado
We are hearing people now talking about, you know, seeing general, like Donald Knuth for example, in the last few days, right. You know, had this, you know, this, you know, aha moment. Apparently that kind of made went viral on X. So do you think that that suggests that we're seeing generality or.
Vishal Misra
No, no, no. So that actually to me it validates what I've been talking about for a while now.
Martin Casado
How so?
Vishal Misra
So, so if you read what he did with the help of, you know, a colleague, he got the LLMs to solve this particular problem of finding Hamiltonian cycles, odd numbers. We wouldn't get into that. And he got the LLMs to keep solving for one odd number after the other, right. What he also got to do is after it found a solution for a particular value of M, he made the LLM update its memory with exactly what it learned in solving that problem. So the LLMs tried many different things. You know, something worked, update the memory. So that's kind of like hacking together plasticity.
Martin Casado
Yeah, right.
Vishal Misra
It's learning what it has done as we went along. Again, it's a hacked version of it. You're not changing the weights, you're just sort of improving the context. But as you learned, and even after that, so this whole space of Hamiltonian cycles and the associated math is well represented in the manifolds that these LLMs have been trained on. You just had to find the right connection. And LLMs, I know, compute, you throw enough, compute they will find the right connection. So Knuts was able to find the LLMs attempts and eventually it needed him to put together what he saw into a solution. It definitely helped him get to the solution, but he had to create the new sort of manifold to come to the solution. The LLMs were after a while, stuck. Right. You read what he's written. I mean, it just hot off the press, I think two days ago. Two days ago. But eventually he used the solutions and he came up with the proof. So it's like, like Einstein saw all these evidences, then he thought, what will explain? He came up with a causal model. So Knut and his brain is sort of.
Martin Casado
That's in the Komograph, is the human.
Vishal Misra
Right. And the LLMs are extremely efficient at doing the Shannon part of it. It found all the solutions by trying, you know, various things and learning more
Martin Casado
and more clever way to decompose it. I'm wondering, like, do you think this again? I'm going to ask the same again, which is, do you think this provides some sort of insight on, like the next problem to tackle? Like. Yeah, like. Like, is there a mechanism that will get the Kolmograph complexity or not? Like is this.
Vishal Misra
It tells us which direction to pursue,
Martin Casado
but clearly not how to do it.
Vishal Misra
Like not how to do it. But even Kolmogorov complexity has largely remained sort of a theoretical construct.
Martin Casado
Yeah, for sure. There's no algorithm, there's no.
Vishal Misra
There haven't been practical implementations of finding the shortest program. We know it exists. You know, you can argue about it, but. So that's where I think it's my bias. That's where our energy should be focused, not larger models with more tokens.
Martin Casado
And can you tie the two things? Like how does that pair with doing simulation? Or is that simulation totally orthogonal?
Vishal Misra
No, simulation is related. Right.
Martin Casado
So you think it, like, basically you do simulation and somehow that is a step towards doing the Kolmogorov complexity.
Vishal Misra
The simulator is the program that we create. It may not be the perfect program.
Martin Casado
Oh, I see.
Vishal Misra
But in our heads, we create this simulator that when I'm throwing the pen, you know that it's coming at you.
Podcast Host
Right.
Vishal Misra
And you duck. So you're not computing the probabilities as it goes, but you have, you know,
Martin Casado
you build an accuracy thing versus we are talking more conceptually.
Vishal Misra
Conceptually. But it's the same mechanism.
Martin Casado
And you think those are the same mechanisms.
Vishal Misra
It's the same mechanism, really. Yeah. You have to build a causal model.
Martin Casado
Right, I see.
Vishal Misra
For most things. Right. So you have to move from Correlation to causation. I mean, we've heard this term
Martin Casado
ad
Vishal Misra
infinitum, but here it's making a difference in the way we view intelligence.
Martin Casado
How have the last three papers been received?
Vishal Misra
No, I don't know. Well, the archive versions, let me tell
Martin Casado
you, a lot of great reception, A lot of people read it. I'm just wondering what kind of feedback that you've got.
Vishal Misra
I'm getting good feedback, but I'm an outsider in this field.
Martin Casado
Networking guy.
Vishal Misra
I'm a networking guy. Why is he writing about learning and machine learning and deep learning and Bayesian. But people who have actually taken the time to read those papers, I'm getting really good feedback. There was a recent paper by Google Research which tried to teach LLMs by some sort of RLHF to do Bayesian learning properly. And that's going in this direction. I think people are coming around to the view that, okay, LLMs are doing Bayesian learning. I know that some people also looked at the Bayesian wind tunnel paper, the archive version, and they reproduced the experiments. That's great. They just saw what was written and they did the training and they saw. Yeah, yeah, this is actually happening. So that's great.
Martin Casado
So what's next?
Vishal Misra
What's next is, you know, these two parallel tracks. I hope to make progress there. Plasticity and causality.
Martin Casado
Because to date you've taken an existing mechanism.
Vishal Misra
Yeah.
Martin Casado
And you've created a formal model how it works.
Vishal Misra
Yeah.
Martin Casado
And so now you're actually interested in improving, creating a new mechanism.
Vishal Misra
Yeah, yeah.
Martin Casado
And do you think it's an entirely different architecture or do you think. Do you think LLMs are like part of the solution?
Vishal Misra
I think LLMs are definitely part of the solution. I see. But. But there has to be something more. So, you know, I was not interested in sort of cataloging what all these LLMs can do.
Martin Casado
Yeah.
Vishal Misra
I was more interested in why are they and how are they doing it. I think now we have a good grip on the why and how and the next step is to move them to the next level. Now I think we have a fairly good understanding of what the limits are. Now how do you go to the next step?
Martin Casado
Is there an equivalent kind of theoretical framework for causality that applies here, like, similar to, like Bayesian for inference?
Vishal Misra
Well, the. Judea Pearl's whole causal hierarchy, I think.
Martin Casado
I think that's the right one.
Vishal Misra
That's a very good one. You know, the whole do calculus approach, I think it's a good way to think about it. You know, the sort of association intervention counterfactuals. It takes you from correlation to Achilles simulation in a mathematical way.
Martin Casado
That's great. All right, well listen, really appreciate you coming. This is awesome. So we had you here for the first paper where you had the empirical results. Then we had you back when you actually have like the formal proof. And hopefully the next time you come back you will have a proposal for the mechanism that actually provides the next step.
Vishal Misra
Hopefully. Yeah.
Martin Casado
All right. We're working on it.
Vishal Misra
Thank you for having me.
Podcast Host
Thanks for listening to this episode of the A16Z podcast. If you like this episode, be sure to like, comment, subscribe, leave us a rating or review and share it with your friends and family. For more episodes, go to YouTube, Apple Podcasts and Spotify. Follow us on X1 6Z and subscribe to our substack@a16z.substack.com thanks again for listening and I'll see you in the next episode. As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see a16z.com disclosures.
Date: March 17, 2026
Host: Andreessen Horowitz
Guests: Vishal Misra (Columbia University), Martin Casado (a16z)
This episode dives deep into the actual mechanisms behind large language models (LLMs), specifically focusing on what differentiates current LLM capabilities from true Artificial General Intelligence (AGI). Vishal Misra, Vice Dean of Computing and AI at Columbia University, shares insights from his highly-cited research that formally models how LLMs learn, why they're impressive pattern-matchers, and crucially, what they're still missing. The discussion explores Bayesian inference, the limits of current architectures, the gap from correlation to causation, and what it will take architecturally to make the leap to AGI.
Background on Misra’s Work
Matrix Model Explained
Bayesian Updating in LLMs
Skepticism & Pushback
Bayesian Wind Tunnel Methodology
Quote:
Significance:
On LLM architecture’s limitations:
“They are grains of silicon doing matrix multiplication. They don't have consciousness. They don't have an inner monologue.” — Vishal Misra [26:04]
On empirical certainty of Bayesian processing:
“The results, just to say it for the 10th time, are perfectly Bayesian...to the digit.” — Martin Casado & Vishal Misra [26:48–26:57]
On the core test for AGI:
“Take an LLM and train it on pre-1916…physics and see if it can come up with the theory of relativity. If it does, then we have AGI.” — Vishal Misra [00:00], [32:13]
On what’s next:
"What's next is, you know, these two parallel tracks. I hope to make progress there. Plasticity and causality.” — Vishal Misra [44:28]
| Capability | LLMs Today | Human/AGI Benchmark | What's Missing | |----------------------|------------------------------|----------------------|-------------------------------| | Correlation | Excellent | Good | Already matched | | Causal Reasoning | Weak/None | Robust | Model of the world, simulation| | Continual Learning | Context; not persistent | Lifelong, plastic | Persistent architectural plasticity | | Paradigm Shifts | Cannot generate new manifolds| Can | “Einstein-level” causal leaps | | Objective Function | Next-token prediction | Survival, invention | Purpose beyond data copying | | Consciousness | None | Present (subjective) | Irrelevant for current LLMs |
Vishal Misra's research, as thoroughly detailed in this conversation, marks a pivotal advance in understanding the mechanics—and limitations—of today’s LLMs. While transformers are mathematically optimal Bayesian updaters, true AGI will require not just scale, but fundamentally new approaches: continual/plastic learning and architectures that grasp causality rather than just correlation. The episode ends with both hope and humility for AI's future: the next frontier is causal, not merely statistical.