Podcast Summary: What’s Missing Between LLMs and AGI — Vishal Misra & Martin Casado
AI + a16z | March 17, 2026
Host: a16z
Guests: Vishal Misra (Professor and Vice Dean of Computing and AI, Columbia University), Martin Casado (General Partner, a16z)
Episode Overview
This episode dives deep into the fundamental mechanics of large language models (LLMs), examining how they work, their limitations, and what’s missing between current LLM technology and "true" Artificial General Intelligence (AGI). Vishal Misra shares his research journey, spanning mathematical modelling, the Bayesian nature of transformers, and the difference between correlation and causation in AI. The discussion is technical but peppered with anecdotes, analogies, and frameworks that clarify both the state of the art and the path forward.
Key Discussion Points and Insights
1. How LLMs Work: The Matrix Model
- Prompt Distribution Matrix: Vishal uses the abstraction of a giant, sparse matrix where each row is a prompt, and each column in that row represents the probability of each next token.
"Imagine this huge gigantic matrix where every row of the matrix corresponds to a prompt… columns are a distribution over the vocabulary." – Vishal Misra [04:00]
- Operational Mechanism: LLMs approximate the true distribution of next-token probabilities for any prompt by compressing and generalizing over this vast matrix; the sparsity is harnessed by the model’s efficient internal representations.
2. In-Context Learning and Early Innovations
- Empirical Demonstration: Vishal describes how he got GPT-3 (in 2020) to translate natural language cricket queries into a domain-specific language (DSL) that GPT-3 had never seen — a practical instance of retrieval-augmented generation (RAG).
"When a new query came in... GPT3 would complete it in the DSL that I had designed, which until milliseconds ago it had never seen. And I had no access to internals of GPT3, I had no access to the weights, but still it worked." – Vishal Misra [11:26]
- In-Context Learning Mechanics: The in-context examples update the model’s next-token probabilities in real time, which Vishal ties to Bayesian updating.
3. Bayesian Updating: Empirical and Mathematical Foundations
- Initial Empirical Evidence: Vishal's early work showed that as LLMs are exposed to new in-context evidence, their output probabilities shift in a way that looks Bayesian.
"LLMs are doing something which resembles Bayesian updating." – Vishal Misra [13:38]
- Pushback from the Field: Some objected, citing the "Bayesian vs. frequentist" schism in ML, and the claim that “anything can be said to be Bayesian.”
- The Bayesian Wind Tunnel: To settle the debate, Vishal and his colleagues developed benchmark tasks (“wind tunnels”) where a small, blank architecture can’t memorize solutions but the true Bayesian posterior is known analytically.
- In controlled settings, transformers matched the Bayesian posterior “to within (10^{-3}) bits of accuracy.”
- Key Finding: This Bayesian behavior is due to the architecture, not merely data. Other architectures (like LSTMs, MLPs) proved less effective.
4. Limits of Human-like Intelligence in LLMs
- Plasticity and Continual Learning: Humans’ synapses are plastic — they can “keep learning” for life. LLMs, by contrast, have fixed weights after training; in-context learning is forgotten after the session.
"What happens with LLMs is once the training is done, those weights are frozen... you forget, the next time a new conversation starts with zero context" – Vishal Misra [24:01]
- Objective Function Differences:
- Humans evolved to "not die and reproduce."
- LLMs aim to optimize next-token prediction.
"They're not driven by the same objective function. Don't die, reproduce. They're driven by don't make a mistake on the next token." – Vishal Misra [26:04]
5. Correlation vs. Causation – The Real Boundary
- Shannon vs. Kolmogorov Complexity:
- Shannon entropy (deep learning excels at) captures unpredictable, random-seeming information.
- Kolmogorov complexity represents the shortest program to reproduce a sequence (e.g., the digits of Pi).
- Deep learning models are locked into learning correlations (Shannon world) and can’t discover causal, generative models (Kolmogorov world).
"Deep learning is still in the Shannon entropy world. It has not crossed over to the Kolmogorov complexity and the causal world." – Vishal Misra [29:38]
- Causal Reasoning as Simulation:
- Humans dodge a thrown pen not by statistical inference but by simulating consequences.
“You're not computing the probabilities... but your mind simulates and you dodge it.” – Vishal Misra [27:29]
- Judea Pearl's causal hierarchy (association → intervention → counterfactual) is cited as the theoretical framework missing in LLMs.
6. AGI: What Would It Take?
- Plasticity (Continual Learning) and Causal Modeling are required:
- Continual learning must avoid catastrophic forgetting.
- Causation requires new architectures or significant modifications to current ones.
"To get to AGI, I think there are two things that need to happen. One is this plasticity... Secondly, we have to move from correlation to causation." – Vishal Misra [31:04]
- Einstein Test for AGI:
- If an LLM trained only on pre-1916 physics could derive the theory of relativity, "then we have AGI."
- Current models only repeat (correlate) existing data; they can’t invent fundamentally new frameworks.
7. Recent Noteworthy Developments & Community Reaction
- Donald Knuth's Viral LLM Use: Knuth used LLMs to solve math problems by updating their context, mirroring a form of "hacked plasticity," but the final leap required human causal reasoning.
"Eventually he used the solutions and he came up with the proof.... The LLMs were after a while, stuck." – Vishal Misra [40:54]
- Field Reception:
- Vishal is an outsider (networking background) but has garnered positive feedback, with independent replications of his Bayesian wind tunnel results and ongoing interest from major ML labs.
8. Next Directions
- Plasticity and Causality — The Two Parallel Tracks:
"What's next is, you know, these two parallel tracks. I hope to make progress there. Plasticity and causality." – Vishal Misra [44:28]
- LLMs may be part of the eventual solution, but fundamentally new mechanisms and architectures are required.
Notable Quotes & Memorable Moments (with Timestamps)
- "Pattern matching is not intelligence. LLMs learn correlation, they don't build models of cause and effect." – Podcast Host [01:01]
- "I was amazed that it worked. I wanted to understand how it worked." – Vishal Misra [02:44]
- "The number of rows in this matrix is more than the number of electrons across all galaxies." – Vishal Misra [06:22]
- "The difference with humans is ... our synapses remain plastic throughout our lifetime. What happens with LLMs is once the training is done, those weights are frozen." – Vishal Misra [24:01]
- "Deep learning is still in the Shannon entropy world. It has not crossed over to the Kolmogorov complexity and the causal world." – Vishal Misra [29:38]
- "You take an LLM and train it on pre-1916 or 1911 physics and see if it can come up with the theory of relativity. If it does, then we have AGI." – Vishal Misra [32:13]
- "To me, AGI will happen when these two problems get solved. Plasticity, continual learning properly and building a causal model in a more data efficient manner." – Vishal Misra [38:03]
Important Segments & Timestamps
- LLM Matrix Model – [03:37–07:53]
- In-Context Learning Example & RAG – [08:44–13:38]
- Empirical Bayesian vs Formal Proof – [15:55–21:18]
- Bayesian Wind Tunnel Experiments – [19:07–21:18]
- Human vs. LLM Learning/Plasticity – [23:06–25:21]
- Shannon Entropy vs. Kolmogorov Complexity – [27:29–30:14]
- The Einstein AGI Test – [32:13–36:15]
- Donald Knuth LLM Use Case – [38:34–41:13]
- Research Directions: Plasticity and Causality – [44:28–45:28]
Summary Table: LLMs vs. Human Intelligence
| Aspect | LLMs (Transformers) | Humans | |----------------|---------------------------------------------|-----------------------------------------| | Memory | Fixed weights after training | Continually plastic, lifelong learning | | Objective | Minimize next-token prediction error | Survival and reproduction | | Reasoning Type | Bayesian updating (correlation/statistics) | Bayesian + causal modeling (simulation) | | Creativity | Bounded by training data/manifold | Can invent new manifolds (frameworks) | | Example | Can find correlations, not new physics | Einstein: invented relativity |
Conclusion
The conversation clearly articulates why LLMs, despite their astonishing progress and utility, are fundamentally limited. They are supreme correlators and Bayesian updaters but lack long-term plasticity and causal reasoning — the “missing pieces” to AGI. Progress will require not just scaling LLMs, but fundamentally new architectures that are both plastic and able to discover (not just recapitulate) causal models of reality.
Explore further:
- Vishal’s TokenProbe: tokenprobe.cs.columbia.edu
- Papers referenced (search "Bayesian Wind Tunnel" by Misra et al.)
