The a16z Show: What's Missing Between LLMs and AGI – Vishal Misra & Martin Casado
Date: March 17, 2026
Host: Andreessen Horowitz
Guests: Vishal Misra (Columbia University), Martin Casado (a16z)
Episode Overview
This episode dives deep into the actual mechanisms behind large language models (LLMs), specifically focusing on what differentiates current LLM capabilities from true Artificial General Intelligence (AGI). Vishal Misra, Vice Dean of Computing and AI at Columbia University, shares insights from his highly-cited research that formally models how LLMs learn, why they're impressive pattern-matchers, and crucially, what they're still missing. The discussion explores Bayesian inference, the limits of current architectures, the gap from correlation to causation, and what it will take architecturally to make the leap to AGI.
Key Discussion Points & Insights
1. How LLMs Work: Matrix Models and Bayesian Updating
Background on Misra’s Work
- Five years ago, Misra got GPT-3 to translate natural language into a new domain-specific language (DSL) for querying cricket stats—a system deployed at ESPN. This early experience drove his desire to mathematically model LLM behavior.
- Misra's abstraction: LLMs as an enormous sparse matrix, each row representing a prompt, each column a probability distribution over the vocabulary for the next token ([03:50]).
Matrix Model Explained
- Each LLM prompt corresponds to a probability distribution for the next word/token: “So you imagine this huge gigantic matrix where every row...corresponds to a prompt…columns are a distribution over the vocabulary.” — Vishal Misra [03:50]
- Given the vastness of language, the matrix is intensely sparse, but LLMs function as “compressed representations” of such a matrix, approximating the correct distribution for any prompt.
Bayesian Updating in LLMs
- LLMs update their posterior belief (distribution) with new context/examples provided in prompt, akin to Bayesian inference.
- Empirically observed with the cricket DSL task: “With every example [GPT-3] saw, probability of the DSL tokens went up. When I gave a new query, it had almost 100% probability of getting the right token.” — Vishal Misra [12:58]
2. Empirical to Formal Proof: The ‘Bayesian Wind Tunnel’
Skepticism & Pushback
- Early papers observed Bayesian-like behavior but faced academic skepticism ("anything could be considered Bayes"). To prove it formally, the team devised the “Bayesian wind tunnel” ([19:07]).
Bayesian Wind Tunnel Methodology
- Created tasks small enough that the correct Bayesian posterior is analytically calculable but too large for a small model to memorize.
- Tested blank architectures (transformers, MAMBAs, LSTMs, MLPs) with these tasks.
- Result: Transformers (and to a lesser extent MAMBAs) precisely matched the theoretical Bayesian posterior, to 10⁻³ bits accuracy ([20:32]).
Quote:
- “We trained these models and we found that the transformer got the precise Bayesian posterior down to 10 to the power minus 3 bits accuracy. It was matching the distribution perfectly.” — Vishal Misra [20:32]
Significance:
- It’s the architecture, not the data, that enables this; Transformers generalize Bayesian inference internally ([21:27]).
3. Limits of LLMs vs. Human Intelligence
Plasticity and Continual Learning
- Unlike humans—whose brains are “plastic,” constantly updating with experience—LLMs weights are frozen after training; all “in-context learning” is ephemeral and doesn’t persist beyond the session ([24:02]).
- Quote:
“Once the training is done, [LLM] weights are frozen. When you’re doing an inference, OK, you’re doing Bayesian inference...But then you forget, the next time a new conversation starts with zero context.”—Vishal Misra [24:02]
Objective Function Differences
- Human brains: optimization objective = survive and reproduce.
- LLMs: objective = predict the next token correctly.
- Scary “emergent” behaviors in LLMs are just reflection of training data, not any inner drive ([25:56]).
No Consciousness or Inner Monologue
- Responding to industry claims about possible LLM “consciousness”:
“They are grains of silicon doing matrix multiplication. They don't have consciousness. They don't have an inner monologue.” — Vishal Misra [26:04]
4. Correlation vs. Causation: The Missing Leap to AGI
- LLMs are powerful correlation engines; they excel at association (first level of Judea Pearl’s causal hierarchy), not causation ([30:14]).
- Humans, by contrast, can imagine/simulate (counterfactuals, interventions):
“All of deep learning is doing correlations, it’s not doing causation...Causal models are the ones that are able to do simulations and interventions.”—Vishal Misra [29:32]
Shannon Entropy vs. Kolmogorov Complexity
- LLMs optimize for Shannon entropy (correlation-based compression); AGI requires crossing to Kolmogorov complexity (shortest program/causal model). Example: The digits of π are maximum entropy for deep learning, but humans can understand their low Kolmogorov complexity (calculable via a simple program) ([29:33]).
5. Why Scaling Up Isn’t Enough
- “One of the misconceptions that exists today is that scale will solve everything. Scale will not solve everything. You need a different kind of architecture.” — Vishal Misra [31:04]
- Two critical unsolved pieces:
- Implementing Plasticity: Making learning updates persistent, not just ephemeral.
- Moving from Correlation to Causation: Building models capable of simulating causal relations.
6. The Einstein/Relativity Test for AGI
- AGI needs to infer new representations (manifolds), not just fit past data:
"Take an LLM and train it on pre-1916…physics and see if it can come up with the theory of relativity. If it does, then we have AGI.” — Vishal Misra ([00:00], [32:13]) - LLMs, bound by the statistical majority, can't "jump" to new paradigms like Einstein did; they treat outliers as anomalies, not seeds for new theories ([35:55]).
7. Current Research Frontiers & Directions
- Continual/Plastic Learning: Balancing learning new things without catastrophic forgetting is a major research challenge ([31:04]).
- Simulation & Causality: Next advances may come from architectures or systems that encode causal/SIMULATIVE reasoning (reference to Judea Pearl’s “do-calculus”) ([45:38]).
- Kolmogorov Complexity as Target: Instead of scaling up LLMs, focus on algorithms that can distill knowledge to its causal essence/shortest program ([41:48]).
- LLMs as Substrate, But Not the Whole Solution: Misra sees transformer LLMs as a piece of AGI, but something more is required.
8. Notable Real-World Explorations
- Donald Knuth’s LLM Experiment: Using LLMs to solve incrementally harder mathematical tasks, updating its own memory as it progressed. This shows generalization, but “plasticity” was hacked onto, and human intervention was still needed for final, causal leaps ([38:38]).
Timestamped Memorable Quotes & Moments
-
On LLM architecture’s limitations:
“They are grains of silicon doing matrix multiplication. They don't have consciousness. They don't have an inner monologue.” — Vishal Misra [26:04] -
On empirical certainty of Bayesian processing:
“The results, just to say it for the 10th time, are perfectly Bayesian...to the digit.” — Martin Casado & Vishal Misra [26:48–26:57] -
On the core test for AGI:
“Take an LLM and train it on pre-1916…physics and see if it can come up with the theory of relativity. If it does, then we have AGI.” — Vishal Misra [00:00], [32:13] -
On what’s next:
"What's next is, you know, these two parallel tracks. I hope to make progress there. Plasticity and causality.” — Vishal Misra [44:28]
Important Segment Timestamps
- [03:50] LLMs as sparse matrices—fundamental abstraction
- [12:58] Bayesian updating illustrated with in-context cricket DSL experiment
- [19:07] “Bayesian wind tunnel” methodology and its implications
- [24:02] Limits of in-context learning—plasticity vs. frozen weights
- [29:32] Correlation vs. causation, and the need for simulation
- [32:13] The Einstein/Relativity AGI Test
- [41:13] Kolmogorov complexity, theoretical leaps, and future directions
- [45:38] Judea Pearl and the causal hierarchy as a research blueprint
Summary Table: LLMs and AGI – What’s Missing
| Capability | LLMs Today | Human/AGI Benchmark | What's Missing | |----------------------|------------------------------|----------------------|-------------------------------| | Correlation | Excellent | Good | Already matched | | Causal Reasoning | Weak/None | Robust | Model of the world, simulation| | Continual Learning | Context; not persistent | Lifelong, plastic | Persistent architectural plasticity | | Paradigm Shifts | Cannot generate new manifolds| Can | “Einstein-level” causal leaps | | Objective Function | Next-token prediction | Survival, invention | Purpose beyond data copying | | Consciousness | None | Present (subjective) | Irrelevant for current LLMs |
Closing Note
Vishal Misra's research, as thoroughly detailed in this conversation, marks a pivotal advance in understanding the mechanics—and limitations—of today’s LLMs. While transformers are mathematically optimal Bayesian updaters, true AGI will require not just scale, but fundamentally new approaches: continual/plastic learning and architectures that grasp causality rather than just correlation. The episode ends with both hope and humility for AI's future: the next frontier is causal, not merely statistical.
