Eye on A.I. – Episode #326: Zuzanna Stamirowska on Pathway’s Post-Transformer Architecture Designed for Memory and On-the-Fly Learning
Date: March 11, 2026
Host: Craig S. Smith
Guest: Zuzanna Stamirowska, CEO & Co-Founder of Pathway
Episode Overview
This episode features Zuzanna Stamirowska, co-founder and CEO of Pathway, discussing her team’s groundbreaking “post-transformer” neural architecture, codenamed Dragon Hatchling (BDH). The conversation focuses on the limitations of transformer architectures, the challenge of memory in AI, the principles underpinning BDH, and its implications for reasoning, continual learning, and future AI systems. The episode dives deep into network structure, emergent complexity, neuroscience parallels, and the path from research to real-world applications.
Key Discussion Points & Insights
1. Transformers, Memory and Their Limitations
-
Transformer Constraints: Zuzanna likens large language models (LLMs) to “interns that never get better with experience,” as transformers lack enduring memory and adaptability ([02:23]).
"It's a bit like having an intern that you hire on the first day... she may be brilliant, but she stays an intern forever on the first day of her job. Doesn't get more context, doesn't get better with experience." – Zuzanna ([10:30])
-
Current Capabilities: LLMs are restricted to “short-term” reasoning, executing tasks spanning only a few hours before internal context is lost ([10:30]).
-
Urgency for Memory: The Pathway team views enhanced, structured memory as a core vector for the next wave of AI progress.
2. Background & Motivations
-
Interdisciplinary Lens: Zuzanna’s foundation is in complexity science, focusing on emergent phenomena within networks—studying how global order emerges from local interactions over graphs ([04:53]).
"A graph is a network... dots represent nodes and then you have edges that link nodes... it's an evolving system where things happen locally, not necessarily always planned." ([04:53])
-
Team Composition: The Pathway team includes experts in attention mechanisms from the pre-transformer era, quantum physics, and theoretical computer science ([04:53]).
3. Inside the Dragon Hatchling Architecture
-
Post-Transformer Shift: BDH organizes memory in the synaptic edges (connections), using principles akin to Hebbian learning (“neurons that fire together, wire together”), enabling persistent, learnable memory during inference ([14:33]).
"We have neurons... linked with synapses. In a traditional transformer, everybody is connected to everybody... Here, not everybody is connected to everybody. We have a graph structure of relevant connections... trained with signal..." ([14:33])
-
Sparse, Localized Computation: Unlike transformers’ all-to-all attention, BDH utilizes a sparse graph where activations and memory are shared only among locally connected nodes ([14:33]; [25:50]).
-
Plasticity: Synaptic plasticity underpins learning—connections strengthen with use and fade without, echoing biological neural networks ([41:23]).
-
Architectural Efficiency: The structure allows for massive memory capacity without the need to scale up the number of neurons, focusing instead on the richness of synaptic connections.
-
Emergent Topologies: Over time, the graph tends toward scale-free network structures, offering resilience and efficient communication similar to natural networks ([35:47]).
4. Comparisons & Analogies
-
Difference from Mixture-of-Experts: BDH’s modules (experts) can be as fine as a single connection, discovering their roles organically, unlike hand-engineered expert subgroups in mixtures ([33:53]).
-
Brain-Like Reasoning: Local activation and threshold-like mechanisms reflect biological neural principles, bridging neuroscience and artificial learning ([32:30]).
"We are offering plausible brain-like explanation for how reasoning may appear." – Zuzanna ([32:30])
-
Graph vs. Layered Networks: Traditional neural nets have rigid layers; BDH is more flexible and organic, allowing new knowledge to be “packed” in the structure without overwriting previous skills ([28:36]; [41:23]).
5. Continual Learning, Catastrophic Forgetting, and Robustness
-
Locality and Memory: With vast, sparsely-accessed state, BDH avoids catastrophic forgetting, enabling continual learning without erasure of prior knowledge ([25:50], [41:23]).
"You may have huge state and use very little of it at every step because of the local dynamics... we found a way... to make it work on GPU and make them seem like dense matrices, but actually... preserving the sparsity..." ([25:50])
-
Robust to Network Damage: The system’s function can redistribute after loss of nodes/connections, similar to resilience found in natural brain architectures ([41:23]).
6. Memory at Inference and Long-Horizon Reasoning
-
On-the-fly Learning: Learning isn’t just during training—connections strengthen at inference, continually refining memory ([54:23]).
"All the dynamics that we talked about happen at inference... because you keep the memory, you can keep consistency... you have a structure to hold on to." ([54:23])
-
Reduction of Hallucination: Maintaining context over long periods should greatly decrease hallucinations compared to standard autoregressive transformers ([54:23]).
7. Creativity, Emergence, and Generalization
- Potential for Creative Reasoning: Zuzanna expresses hope that structure, plasticity, and emergent representations will enable true innovation, “eureka moments,” and generalization over time—going beyond just remixing learned patterns ([46:28]).
"The real innovation wouldn't be just recomposing things that exist, but seeing the loophole... come up with eureka moments... this will go through having some sort of internal representation that allows for it." ([46:28])
8. Performance, Scaling, and Productization
-
Performance: Early BDH models, with 1B parameters, match or outperform transformers (e.g., GPT-2 scale) on language tasks, often requiring less data ([37:39]).
-
Scaling Laws: BDH adheres to favorable scaling, but does not “parallel the endless scaling of N” (number of neurons) seen in transformer races ([55:51]).
"We're not playing the scaling game in the sense that we don't need to grow the N so much... the power won't be coming from the scale." – Zuzanna ([55:51])
-
Real-World Applications:
- Infinite Context & Enhanced Reasoning: Suitable for tasks requiring long-term, dynamic memory, small but high-value datasets, continual learning, and explainable, personalized reasoning ([50:28]).
- Enterprise and Regulated Domains: Early use cases include healthcare claims resolution and nuclear industry documentation ([50:28]).
- Industry Collaboration: Partnership with AWS and Nvidia to bring the first models to cloud customers ([49:37]).
9. Model Merging and Compositionality
- Mergeability: BDH models, being graph-based, can be merged along the node/connection axis, akin to composable software modules; this could allow combining specialized models post-training ([63:26]).
Notable Quotes & Memorable Moments
-
Transformers & Memory:
"They [transformers] work a bit like a Groundhog Day. I mean, they wake up every day with their memory... wiped out." – Zuzanna ([10:30])
-
Organic Network Growth:
“You have this loops, you know, in a way that—well, does the network structure what's happening or is it, is it the other way around? Kind of chicken and neck problem.” – Zuzanna ([04:53])
-
On Generalization & Creativity:
"The most imminent generalization that we are after is generalization over time. The real innovation wouldn't be just recomposing things that exist, but pretty much seeing the loophole." – Zuzanna ([46:28])
-
Plasticity and Resilience:
"Function shapes the network... sometimes swap the nerves and the function of them—the stimulus... shapes the network." – Zuzanna ([41:23])
-
BDH vs. Transformers:
"We actually use it very locally... This is key. This is very different to everything else that happens in the AIs, far as I know, because we are really focusing on the local dynamics..." ([25:50])
-
On Network Merging:
"...We glue them along this one dimension N... we actually on language we do... glue the two models trained into two different languages and then all of a sudden they start to speak some sort of like Creole..." ([63:26])
Timestamps for Key Segments
- Transformer memory limits & analogy – [10:30]
- Complexity science background – [04:53]
- How BDH stores memory in connections – [14:33]
- Sparse, local computation & synaptic plasticity – [25:50], [41:23]
- Brain analogy and spiking/rumor-spreading – [32:30], [28:36]
- Avoidance of catastrophic forgetting – [41:23]
- Hallucinations and inference-time learning – [54:23]
- Possible for creativity & emergence – [46:28]
- Industrial productization and AWS partnership – [49:37], [50:28]
- Model merging/composability – [63:26]
Closing & Further Resources
-
Connect with Pathway and Zuzanna:
- LinkedIn (Zuzanna)
- researchpathway.com
- [Dragon Hatchling paper on arXiv and GitHub] (see episode notes for links)
-
Advice:
"LLMs are doing a phenomenal job of this paper especially, especially the reasoning ones, once we ask them to properly read the paper... I would strongly encourage you to use AI to read about AI." – Zuzanna ([65:54])
Summary prepared for listeners seeking a comprehensive understanding of episode #326 of Eye on A.I. with Zuzanna Stamirowska.
