Podcast Summary: Latent Space – Information Theory for Language Models with Jack Morris
Release Date: July 2, 2025
Host: Swyx (Latent.Space)
Guest: Jack Morris (PhD student, Cornell Tech; researcher in AI, NLP, and information theory)
Overview
This episode dives deep into the intersection of information theory and language models (LLMs), with a focus on Jack Morris' ground-breaking research. The conversation explores the evolution of AI research in academia versus industry, foundational questions about model capacity, embedding inversion, and the role of data in AI breakthroughs. Jack shares candid insights from his own PhD journey, practical advice for aspiring researchers, and speculates on the next paradigm shifts in AI.
Key Discussion Points
Academic AI in the Era of Foundation Models (00:03–08:49)
- Background of Jack Morris: Jack is a Cornell Tech PhD student under Sasha Rush.
- Started in ML at a state university without a deep learning department (2017–2018).
- Google AI Residency during the pandemic, then entered grad school in 2021.
- Witnessed the major shift with ChatGPT’s public release (late 2022).
- Industry Shifts: The explosion of LLMs moved much "frontier" research into private companies.
- The Hardware Bottleneck: Academia lagged behind industry’s compute; transition from BERT-sized (100M params) to GTP-scale models (7–8B+).
- Jack: "There was kind of two years where everyone in academia was working on smaller models and none of it really mattered." (07:46)
Advice for Grad Students and AI Engineers (08:49–14:31)
- Don't Just Build Benchmarks: Jack chose to work on models, not only benchmarks, defying the typical industry advice.
- Distributed Training Experience:
- "There's probably basically no grad students doing multi node training. I mean, there's probably a few, especially if they have like company affiliations. But that's really unusual, I think." (09:38)
- Practical Learning: Encouragement to join public Discords (GPU mode, FastAI) and engage directly with toolkit developers (DeepSpeed at Microsoft).
- Learning CUDA/Mojo:
- "It's definitely a great idea to learn CUDA if you can. ... If you do it, you've got to be one of the most hireable people in the world." (11:20)
- Mojo is possibly the new practical sweet spot for performance AI engineering (12:29–14:10).
Information Theory & Language Models: Foundations (15:17–23:14)
- V-Information and Usable Information: Jack discusses "A New Type of Information Theory" and the concept of V-Information from a 2020 paper.
- Shannon's theory measures information without accounting for computational extractability.
- "We should measure information with computational power as a constraint. ... Maybe the left text file actually has more extractable information." (15:50)
- Analogy: Just as the "bit" unified communication theory, we lack a comparable metric for LLMs.
- Kolmogorov Complexity and Compression: Direct connection between LLM training, model weights, and data compression (21:56).
- "I think we have a very good understanding of language model pre-training and there's a deep connection between language models and compression." (21:56)
Embedding Research and Practical Implications (23:14–31:09)
- Text Embeddings Reveal Almost as Much as Text:
- Motivation: Understanding what information is recoverable from vector embeddings, especially as vector databases gain popularity.
- "First, ... reverse engineer the text that's in embeddings." (23:38)
- In practice, Jack's team showed that embeddings hold a surprising amount of recoverable text (up to ~90% for certain lengths).
- Privacy Implications:
- "If someone hacks into a vector database, what do they actually find?" (25:46)
- Research Journey: Jack details the rewarding—and at times frustrating—process of moving from incremental improvements to a major breakthrough in embedding inversion.
- Memorable moment: “I got it to 35 [accuracy]. ... Then we ended up getting the number to like 97 ... And that was like, so great.” (30:32)
Universal Geometry of Embeddings & Model Alignment (31:20–50:38)
- Platonic Representation Hypothesis (MIT, 2024): Different models converge to similar representations as data and scale increase.
- "As the models get better ... they're sort of converging to learn the exact same thing." (32:38)
- Cross-Model Embedding Alignment with CycleGAN: Inspired by CycleGAN for images, Jack applied similar ideas to map embeddings between very different models.
- "When we do this cycle GAN in the embedding space, they just perfectly sort of snap to the same place, which is amazing and has some pretty deep implications…” (45:19)
- Practical Implication: Swappable, stackable model adapters — enables modular, efficient multimodal AI design (49:52).
- Limitations and Questions:
- How much information can you cramp into an embedding?
- "If you embed an entire book to a 500 dimensional vector, ... there must be... collisions.” (35:48)
- Superposition and the physical nature of model representations remain open research questions.
Model Capacity, Memorization, and Information Limits (52:00–58:52)
- What is Stored in the Weights?
- Paper: "How much can language models memorize?"
- "Transformers that are trained in 32 bit precision, we approximate, can store about 3.6 bits of information ... per parameter." (55:58)
- Real Limits?
- Is this an optimization artifact or a true architectural bound?
- "If you have 32 bits available and you can only use three to four of them... you could build your own AI lab if you can make these models that much more efficient." (56:14)
- Memorization vs. Generalization:
- "The best memorizer model may not be the best generalizer model... You just get the best actual compressor. You're just going to get gzip." (59:16)
Recovering Training Data from Model Weights (60:16–66:00)
- Emerging Research: Can you reconstruct training data from model weights or checkpoints?
- Jack's group uses the difference between two checkpoints (pre- & post-finetune) to gradient-select likely training samples from large candidate data sets.
- "So the way we put this, you have this kind of like difference in parameter space telling you what deep seq fine tuned on. ... But we have no tool for like interpreting or kind of decrypting this weight difference." (61:16)
- Practical Impacts: Provocative in the context of open weight releases (e.g., Gemma, Deepseek, Llama).
"There Are No New Ideas in AI, Only New Datasets" (66:07–74:29)
- Jack’s Thesis (Blog Post):
- "There are no new ideas in AI, only new data sets."
- Cites the Structure of Scientific Revolutions (Kuhn): AI progress occurs through rare paradigm shifts, each triggered by new data.
- Four Paradigm Shifts in Modern AI:
- Emergence of deep neural networks via ImageNet
- Transformers and web-scale pretraining (BERT, GPT)
- Instruction tuning via human feedback data
- Reasoning via verified symbolic data (math solvers, code, etc.)
- "None of the paradigm shifts in AI were really just about new methods. Data was decisive." (Summary of 67:27–72:18)
- Open Question: What comes after reasoning? The field awaits the next fundamental data shift.
- "Predicting the future is too damn hard... maybe it'll be obvious to me in hindsight." (74:11)
Notable Quotes & Memorable Moments
"There was kind of two years where everyone in academia was working on smaller models and none of it really mattered."
— Jack Morris (07:46)
"We should measure information with computational power as a constraint... [V-information] measures how much information is extractable from a given file or code."
— Jack Morris (15:50)
"It was so rewarding ... We had this number that was like 30 for months. ... Then we ended up getting the number to like 97."
— Jack Morris, on his embedding inversion breakthrough (30:32)
"As the models get better by scaling data and scaling model size, they're sort of converging to learn the exact same thing."
— Jack Morris, on the Platonic Representation Hypothesis (32:38)
"Transformers that are trained in 32 bit precision, we approximate, can store about 3, 6 bits of information to maybe 3.9 bits somewhere in there per parameter."
— Jack Morris (55:58)
"The best memorizer model may not be the best generalizer model... You're just going to get gzip."
— Swyx (59:16)
"There are no new ideas in AI, only new data sets."
— Jack Morris (67:03, thesis of the episode)
Timestamps for Important Segments
- 00:03 – Jack’s background and the shifting landscape of AI research.
- 07:46 – The impact of scaling and compute in academia.
- 11:20 – On specialization in CUDA/GPU versus broader model knowledge.
- 15:50 – V-information and a new view of information theory for LLMs.
- 23:38 – Embedding inversion: “reverse engineering the text that’s in embeddings.”
- 30:32 – Achieving a breakthrough in embedding inversion accuracy.
- 45:19 – Embedding alignment and the Platonic Representation Hypothesis.
- 52:00 – Discussion on language model capacity and information limits.
- 61:16 – Approximating training data from weight checkpoints.
- 67:03 – The “no new ideas, only data” hot take.
- 74:11 – Reflections on predicting AI’s next paradigm shift.
Closing & Where to Find Jack
- Contact & Research:
- Jack is active on Substack, Twitter, and open to collaborations on information-theoretic problems in ML and LLMs.
- "If you want to work on anything within that space or that's ... adjacent to the problems that we discussed in terms of ... model weight and activation information, that's very interesting to me and I would love to talk." (77:17)
- For more papers, visuals, and updates: Check the show notes on latent.space.
This episode is a deep exploration, blending hard research with practical implications and field-wide perspective. Jack’s work on information theory for language models foreshadows potential future breakthroughs in both model understanding and AI system design.
