336 | Anil Ananthaswamy on the Mathematics of Neural Nets and AI - Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

Summary

Episode Overview

Podcast: Sean Carroll's Mindscape
Episode: 336 – Anil Ananthaswamy on the Mathematics of Neural Nets and AI
Date: November 24, 2025
Host: Sean Carroll
Guest: Anil Ananthaswamy, science writer and author of "Why Machines Learn: The Elegant Math Behind Modern AI"

Theme:
This episode dives deep into the mathematical foundations of artificial neural networks and their impact on contemporary AI, especially large language models (LLMs). Anil Ananthaswamy, science writer and former engineer, shares the story of his own intellectual journey into the math that underpins modern AI, framing it as both elegant and surprisingly classical. Their wide-ranging discussion explores historical breakthroughs, core mathematical concepts, the history and myths of neural net research, and the recent leaps enabled by architectures like Transformers.

Key Discussion Points and Insights

1. Why Write About AI Mathematics? (04:35–10:08)

Anil’s Motivation:
- Noticed rise of machine learning in science stories as early as 2016–17.
- Unlike other complex fields, felt that, with his engineering background, he could “get his hands dirty” with machine learning directly ([05:52]).
- Undertook a personal project during an MIT Knight Science Journalism Fellowship: could a deep neural net do what Kepler did with planetary motion?
  - Short answer: “absolutely not” (06:35): neural nets are too data-hungry; Kepler worked from scant data with rich prior human knowledge ([10:41]).
  - “Today's neural networks are extremely sample inefficient, so they require too much data to do what they need to do.” (11:28)
- COVID lockdown led him to self-study advanced math via online lectures, developing an appreciation for the elegance of the underlying mathematics.
- “I just wanted to share in the beauty of the math that I was encountering.” (09:27)

**2. What Can AI Not Do (Yet)? – The Kepler Story (10:08–14:45)**

Even advanced neural networks or LLMs, when trained on Kepler’s original scant data, “have no way of spitting out some symbolic form of Kepler's laws.” (11:19)
Human science, especially the kind done by Kepler or Einstein, involves conceptual leaps and prior frameworks that current AI struggles to replicate ([14:10]).

3. The Birth of Neural Networks & The Perceptron (15:00–22:01)

Perceptron Era:
- Invented by Frank Rosenblatt (Cornell), late 1950s. Modeled as a single-layer artificial neural network: “a computational unit... does some sort of weighted sum of [inputs], adds a bias term, and then if that... exceeds some threshold, it will produce a one, otherwise... minus one.” (15:13)
- First proof-of-concept for learning linear classification: “the perceptron convergence proof... if the data is linearly separable, then the algorithm will find it in finite time.” (18:15)
- Memorable moment: Anil cites his own book structure inspired by Somerset Maugham—“if it weren’t for this proof, I would not have written the book.” (18:37)
Visualizing High-Dimensional Learning:
- Example: 20x20 pixel images (400D space), with clusters representing digits like '9' or '4'. A “hyperplane” divides these clusters for classification. (19:11)
- “In the case of the perceptron, it will find a hyperplane—and so this will be a 399 dimensional plane that will separate out the two classes of data.” (20:09)

4. The Underappreciated Innovators: Bernie Widrow and the Path to Backpropagation (22:13–26:21)

Bernie Widrow (Stanford) developed digital adaptive filters; realized these could be applied to build artificial neurons.
Tells the story of Widrow and his student Ted Hoff: in a single weekend, they invent and build the first hardware artificial neuron—precursor to today’s neural nets ([25:45]).
“That's definitely not what usually happens in grad school!” (Sean, [25:55])

5. Single vs Multilayer Networks: The XOR Problem and the First AI Winter (27:32–32:12)

Thresholded Neurons & Biological Roots:
- The notion of a “threshold” is inspired by biological neurons (26:45).
Limits of Early NNs:
- Addition of multiple layers presented algorithmic difficulties.
- Minsky & Papert’s “Perceptrons” book (1969) proved single-layer NNs couldn’t solve the XOR problem and “kind of underhandedly” suggested multi-layered ones couldn’t either—contributing to the first “AI Winter” (30:12).

6. Escaping Linear Limits: Hopfield Networks and the Dawn of Deep Learning (32:23–40:39)

Hopfield Networks (1980s):
- Fully interconnected, recurrent networks storing associative memories.
- Deeply inspired by physics (Ising models, energy minimization).
- Sean notes: “Obviously they're stealing ideas from physics, but we made up for it by giving them the Nobel Prize.” (36:16)
Backpropagation Breakthrough (1986):
- Key insight: swap step-function activations for differentiable sigmoids.
- Allows backpropagation—the key training algorithm for modern deep nets. “It's an extraordinarily simple idea in retrospect.” (36:24)
- Core is “just the chain rule of calculus”—the same principle applies whether you have 30 or a trillion parameters. (46:49, 50:40)

7. Modern Neural Nets: Feedforward vs Recurrent, and Training Mechanics (40:39–46:22)

Feedforward Networks:
- Inputs flow in one direction; outputs do not affect earlier layers.
- Backpropagation is only used during training, not inference ([41:09]).
Training as Error-Minimization:
- Network’s “loss function” is like a multi-dimensional error landscape; backprop adapts weights to reduce errors, gradually “descending” towards a minimum.
- “The back propagation part...you have to propagate that loss all the way back layer by layer so that you can update the weights of each layer as you go back.” (43:31–45:46)

8. The Math Landscape: Gradient Descent and High Dimensions (46:35–52:14)

Gradient Descent:
- A classical idea—“something that Isaac Newton would have told us about many years ago.” (46:35)
- Modern deep nets operate in loss landscapes of “a trillion dimensions”—still, the underlying math is “very simple stuff in some fundamental sense.” (46:49)
Nonlinearities:
- Due to sigmoids, the landscape is not convex—“these loss landscapes are extraordinarily complex...lots and lots of very good optimal local minima.” (51:26)

9. The Curse—and Blessing—of Dimensionality (52:14–59:04)

The Curse:
- In very high dimensions, “everything is as far away as everything else”—nearest neighbor and similarity-based algorithms break down. (52:51)
- Dimensionality reduction via Principal Component Analysis (PCA) is one solution ([57:09]).
- Quote: “If the data varied equally in all of the higher dimensional axes, then you're in trouble.” (57:08)
The Blessing:
- Sometimes, projecting problems up to very high (even infinite) dimensions actually allows for easier linear separation—kernel methods enable efficient computation by mapping data mathematically rather than physically into higher dimensions. (59:04–62:34)

10. The Transformer Revolution: Why "Attention is All You Need" (63:00–69:50)

Transformers and Attention Mechanisms:
- Modern LLMs (e.g., ChatGPT, GPT-4) use transformers, a deep network that models the contextual relationship between all input words via “attention.”
- “The attention mechanism is essentially the process that allows the transformers to contextualize these vectors. And it's a whole bunch of matrix manipulations. It's just very, very neat matrix math going on.” (66:36)
- During training, the network learns to minimize prediction error for masked next-word tasks, propagating these errors back through the weights of what can be a trillion-parameter model.
Foundation of Scale and Surprising Power:
- Training these models on text scraped from the Internet, with minimal human input, shows the surprising strengths—and weaknesses—of scaling up language prediction ([70:33]).

11. AI’s Near Future – What Breakthroughs Still Remain? (69:50–73:35)

Limits of Scaling:
- “The reason why scaling them up alone will not get us to any kind of generalized intelligence is potentially because a, we have no mathematical guarantee that a language model is 100% accurate. That you cannot guarantee accuracy.” (70:36)
- LLMs “are extremely sample inefficient... they require enormous amounts of data.”
The Next Leap:
- “We’re probably one or two steps away like that from an AI that is capable of generalizing to questions that it hasn’t seen, answering things... patterns that don’t exist in the training data. So effectively going back to our early argument, doing what Kepler did, and LLMs are not those kinds of systems.” (73:01)
- “My hunch is that we're two or three breakthroughs away from something quite transformative.” (73:19)
Ends on optimism for conceptual leaps to come: “That gives the youngsters in the audience something to think about and something to try to do.” (73:35)

Notable Quotes and Memorable Moments

On Machine Learning’s Math:
“This math is actually quite lovely that there are stories here to be told.”
— Anil Ananthaswamy (08:10)
On the Perceptron Proof:
“If it weren’t for this proof, I would not have written the book.”
— Anil Ananthaswamy (18:37)
A Fun Origin Story:
“Over the course of the weekend, [Widrow and Hoff] build the world’s first hardware artificial neuron. Monday morning, they have it working, right?”
— Anil Ananthaswamy (25:45) “That's definitely not what usually happens in grad school!”
— Sean Carroll (25:55)
Transformers and Attention:
“The attention mechanism is essentially the process that allows the transformers to contextualize these vectors. And it's a whole bunch of matrix manipulations. It's just very, very neat matrix math going on.”
— Anil Ananthaswamy (66:36)
The Unsolved Challenge:
“We’re probably one or two steps away like that from an AI that is capable of generalizing to questions that it hasn’t seen, answering things... patterns that don’t exist in the training data. So effectively going back to our early argument, doing what Kepler did, and LLMs are not those kinds of systems.”
— Anil Ananthaswamy (73:01)

Timestamps for Key Segments

Anil’s background and book motivation: 04:35–10:08
Kepler story and present AI limitations: 10:08–14:45
Origins of neural networks, Perceptron, and linear separability: 15:00–22:01
Widrow, Hoff and early hardware neural nets: 22:13–26:21
AI winter and the XOR problem: 27:32–32:12
Hopfield networks, energy minima, and comparison to modern feedforward/recurrent nets: 32:23–41:09
Backpropagation breakthrough explained: 40:39–46:22
Gradient descent, trillion+ dimensional landscapes: 46:35–52:14
Curse/benefit of dimensionality, PCA, kernel methods: 52:14–62:34
Transformers and attention explained: 63:00–69:50
What’s next for AI and the need for new conceptual breakthroughs: 69:50–73:35

Final Thoughts

Sean and Anil’s conversation elucidates the often-overlooked foundations of modern AI: an interplay of simple (yet elegant) mathematical techniques, historical accidents, and engineering. Despite the scale and complexity of today’s neural nets, their behavior and limitations often reflect mathematical insights dating to the 18th and 19th centuries. The looming question—can machines ever “do what Kepler did?”—remains tantalizingly open, and both host and guest agree: for all the extraordinary advances, the greatest AI breakthroughs may be yet to come.

Summary

Episode Overview

Key Discussion Points and Insights

1. Why Write About AI Mathematics? (04:35–10:08)

Anil’s Motivation:
- Noticed rise of machine learning in science stories as early as 2016–17.
- Unlike other complex fields, felt that, with his engineering background, he could “get his hands dirty” with machine learning directly ([05:52]).
- Undertook a personal project during an MIT Knight Science Journalism Fellowship: could a deep neural net do what Kepler did with planetary motion?
  - Short answer: “absolutely not” (06:35): neural nets are too data-hungry; Kepler worked from scant data with rich prior human knowledge ([10:41]).
  - “Today's neural networks are extremely sample inefficient, so they require too much data to do what they need to do.” (11:28)
- COVID lockdown led him to self-study advanced math via online lectures, developing an appreciation for the elegance of the underlying mathematics.
- “I just wanted to share in the beauty of the math that I was encountering.” (09:27)

**2. What Can AI Not Do (Yet)? – The Kepler Story (10:08–14:45)**

Even advanced neural networks or LLMs, when trained on Kepler’s original scant data, “have no way of spitting out some symbolic form of Kepler's laws.” (11:19)
Human science, especially the kind done by Kepler or Einstein, involves conceptual leaps and prior frameworks that current AI struggles to replicate ([14:10]).

3. The Birth of Neural Networks & The Perceptron (15:00–22:01)

Perceptron Era:
- Invented by Frank Rosenblatt (Cornell), late 1950s. Modeled as a single-layer artificial neural network: “a computational unit... does some sort of weighted sum of [inputs], adds a bias term, and then if that... exceeds some threshold, it will produce a one, otherwise... minus one.” (15:13)
- First proof-of-concept for learning linear classification: “the perceptron convergence proof... if the data is linearly separable, then the algorithm will find it in finite time.” (18:15)
- Memorable moment: Anil cites his own book structure inspired by Somerset Maugham—“if it weren’t for this proof, I would not have written the book.” (18:37)
Visualizing High-Dimensional Learning:
- Example: 20x20 pixel images (400D space), with clusters representing digits like '9' or '4'. A “hyperplane” divides these clusters for classification. (19:11)
- “In the case of the perceptron, it will find a hyperplane—and so this will be a 399 dimensional plane that will separate out the two classes of data.” (20:09)

4. The Underappreciated Innovators: Bernie Widrow and the Path to Backpropagation (22:13–26:21)

Bernie Widrow (Stanford) developed digital adaptive filters; realized these could be applied to build artificial neurons.
Tells the story of Widrow and his student Ted Hoff: in a single weekend, they invent and build the first hardware artificial neuron—precursor to today’s neural nets ([25:45]).
“That's definitely not what usually happens in grad school!” (Sean, [25:55])

5. Single vs Multilayer Networks: The XOR Problem and the First AI Winter (27:32–32:12)

Thresholded Neurons & Biological Roots:
- The notion of a “threshold” is inspired by biological neurons (26:45).
Limits of Early NNs:
- Addition of multiple layers presented algorithmic difficulties.
- Minsky & Papert’s “Perceptrons” book (1969) proved single-layer NNs couldn’t solve the XOR problem and “kind of underhandedly” suggested multi-layered ones couldn’t either—contributing to the first “AI Winter” (30:12).

6. Escaping Linear Limits: Hopfield Networks and the Dawn of Deep Learning (32:23–40:39)

Hopfield Networks (1980s):
- Fully interconnected, recurrent networks storing associative memories.
- Deeply inspired by physics (Ising models, energy minimization).
- Sean notes: “Obviously they're stealing ideas from physics, but we made up for it by giving them the Nobel Prize.” (36:16)
Backpropagation Breakthrough (1986):
- Key insight: swap step-function activations for differentiable sigmoids.
- Allows backpropagation—the key training algorithm for modern deep nets. “It's an extraordinarily simple idea in retrospect.” (36:24)
- Core is “just the chain rule of calculus”—the same principle applies whether you have 30 or a trillion parameters. (46:49, 50:40)

7. Modern Neural Nets: Feedforward vs Recurrent, and Training Mechanics (40:39–46:22)

Feedforward Networks:
- Inputs flow in one direction; outputs do not affect earlier layers.
- Backpropagation is only used during training, not inference ([41:09]).
Training as Error-Minimization:
- Network’s “loss function” is like a multi-dimensional error landscape; backprop adapts weights to reduce errors, gradually “descending” towards a minimum.
- “The back propagation part...you have to propagate that loss all the way back layer by layer so that you can update the weights of each layer as you go back.” (43:31–45:46)

8. The Math Landscape: Gradient Descent and High Dimensions (46:35–52:14)

Gradient Descent:
- A classical idea—“something that Isaac Newton would have told us about many years ago.” (46:35)
- Modern deep nets operate in loss landscapes of “a trillion dimensions”—still, the underlying math is “very simple stuff in some fundamental sense.” (46:49)
Nonlinearities:
- Due to sigmoids, the landscape is not convex—“these loss landscapes are extraordinarily complex...lots and lots of very good optimal local minima.” (51:26)

9. The Curse—and Blessing—of Dimensionality (52:14–59:04)

The Curse:
- In very high dimensions, “everything is as far away as everything else”—nearest neighbor and similarity-based algorithms break down. (52:51)
- Dimensionality reduction via Principal Component Analysis (PCA) is one solution ([57:09]).
- Quote: “If the data varied equally in all of the higher dimensional axes, then you're in trouble.” (57:08)
The Blessing:
- Sometimes, projecting problems up to very high (even infinite) dimensions actually allows for easier linear separation—kernel methods enable efficient computation by mapping data mathematically rather than physically into higher dimensions. (59:04–62:34)

10. The Transformer Revolution: Why "Attention is All You Need" (63:00–69:50)

Transformers and Attention Mechanisms:
- Modern LLMs (e.g., ChatGPT, GPT-4) use transformers, a deep network that models the contextual relationship between all input words via “attention.”
- “The attention mechanism is essentially the process that allows the transformers to contextualize these vectors. And it's a whole bunch of matrix manipulations. It's just very, very neat matrix math going on.” (66:36)
- During training, the network learns to minimize prediction error for masked next-word tasks, propagating these errors back through the weights of what can be a trillion-parameter model.
Foundation of Scale and Surprising Power:
- Training these models on text scraped from the Internet, with minimal human input, shows the surprising strengths—and weaknesses—of scaling up language prediction ([70:33]).

11. AI’s Near Future – What Breakthroughs Still Remain? (69:50–73:35)

Limits of Scaling:
- “The reason why scaling them up alone will not get us to any kind of generalized intelligence is potentially because a, we have no mathematical guarantee that a language model is 100% accurate. That you cannot guarantee accuracy.” (70:36)
- LLMs “are extremely sample inefficient... they require enormous amounts of data.”
The Next Leap:
- “We’re probably one or two steps away like that from an AI that is capable of generalizing to questions that it hasn’t seen, answering things... patterns that don’t exist in the training data. So effectively going back to our early argument, doing what Kepler did, and LLMs are not those kinds of systems.” (73:01)
- “My hunch is that we're two or three breakthroughs away from something quite transformative.” (73:19)
Ends on optimism for conceptual leaps to come: “That gives the youngsters in the audience something to think about and something to try to do.” (73:35)

Notable Quotes and Memorable Moments

On Machine Learning’s Math:
“This math is actually quite lovely that there are stories here to be told.”
— Anil Ananthaswamy (08:10)
On the Perceptron Proof:
“If it weren’t for this proof, I would not have written the book.”
— Anil Ananthaswamy (18:37)
A Fun Origin Story:
“Over the course of the weekend, [Widrow and Hoff] build the world’s first hardware artificial neuron. Monday morning, they have it working, right?”
— Anil Ananthaswamy (25:45) “That's definitely not what usually happens in grad school!”
— Sean Carroll (25:55)
Transformers and Attention:
“The attention mechanism is essentially the process that allows the transformers to contextualize these vectors. And it's a whole bunch of matrix manipulations. It's just very, very neat matrix math going on.”
— Anil Ananthaswamy (66:36)
The Unsolved Challenge:
“We’re probably one or two steps away like that from an AI that is capable of generalizing to questions that it hasn’t seen, answering things... patterns that don’t exist in the training data. So effectively going back to our early argument, doing what Kepler did, and LLMs are not those kinds of systems.”
— Anil Ananthaswamy (73:01)

Timestamps for Key Segments

Anil’s background and book motivation: 04:35–10:08
Kepler story and present AI limitations: 10:08–14:45
Origins of neural networks, Perceptron, and linear separability: 15:00–22:01
Widrow, Hoff and early hardware neural nets: 22:13–26:21
AI winter and the XOR problem: 27:32–32:12
Hopfield networks, energy minima, and comparison to modern feedforward/recurrent nets: 32:23–41:09
Backpropagation breakthrough explained: 40:39–46:22
Gradient descent, trillion+ dimensional landscapes: 46:35–52:14
Curse/benefit of dimensionality, PCA, kernel methods: 52:14–62:34
Transformers and attention explained: 63:00–69:50
What’s next for AI and the need for new conceptual breakthroughs: 69:50–73:35

wavePod

336 | Anil Ananthaswamy on the Mathematics of Neural Nets and AI

Get Free Podcast Summaries in Your Inbox

Pick Your Shows

Subscribe Free

Get Instant Summaries

Summary

Episode Overview

Key Discussion Points and Insights

1. Why Write About AI Mathematics? (04:35–10:08)

**2. What Can AI Not Do (Yet)? – The Kepler Story (10:08–14:45)**

3. The Birth of Neural Networks & The Perceptron (15:00–22:01)

4. The Underappreciated Innovators: Bernie Widrow and the Path to Backpropagation (22:13–26:21)

5. Single vs Multilayer Networks: The XOR Problem and the First AI Winter (27:32–32:12)

6. Escaping Linear Limits: Hopfield Networks and the Dawn of Deep Learning (32:23–40:39)

7. Modern Neural Nets: Feedforward vs Recurrent, and Training Mechanics (40:39–46:22)

8. The Math Landscape: Gradient Descent and High Dimensions (46:35–52:14)

9. The Curse—and Blessing—of Dimensionality (52:14–59:04)

10. The Transformer Revolution: Why "Attention is All You Need" (63:00–69:50)

11. AI’s Near Future – What Breakthroughs Still Remain? (69:50–73:35)

Notable Quotes and Memorable Moments

Timestamps for Key Segments

Final Thoughts

Summary

Episode Overview

Key Discussion Points and Insights

1. Why Write About AI Mathematics? (04:35–10:08)

**2. What Can AI Not Do (Yet)? – The Kepler Story (10:08–14:45)**

3. The Birth of Neural Networks & The Perceptron (15:00–22:01)

4. The Underappreciated Innovators: Bernie Widrow and the Path to Backpropagation (22:13–26:21)

5. Single vs Multilayer Networks: The XOR Problem and the First AI Winter (27:32–32:12)

6. Escaping Linear Limits: Hopfield Networks and the Dawn of Deep Learning (32:23–40:39)

7. Modern Neural Nets: Feedforward vs Recurrent, and Training Mechanics (40:39–46:22)

8. The Math Landscape: Gradient Descent and High Dimensions (46:35–52:14)

9. The Curse—and Blessing—of Dimensionality (52:14–59:04)

10. The Transformer Revolution: Why "Attention is All You Need" (63:00–69:50)

11. AI’s Near Future – What Breakthroughs Still Remain? (69:50–73:35)

Notable Quotes and Memorable Moments

Timestamps for Key Segments

Final Thoughts