Summary6 min read

Eye on A.I. – Episode #326: Zuzanna Stamirowska on Pathway’s Post-Transformer Architecture Designed for Memory and On-the-Fly Learning

Date: March 11, 2026
Host: Craig S. Smith
Guest: Zuzanna Stamirowska, CEO & Co-Founder of Pathway

Episode Overview

This episode features Zuzanna Stamirowska, co-founder and CEO of Pathway, discussing her team’s groundbreaking “post-transformer” neural architecture, codenamed Dragon Hatchling (BDH). The conversation focuses on the limitations of transformer architectures, the challenge of memory in AI, the principles underpinning BDH, and its implications for reasoning, continual learning, and future AI systems. The episode dives deep into network structure, emergent complexity, neuroscience parallels, and the path from research to real-world applications.

Key Discussion Points & Insights

1. Transformers, Memory and Their Limitations

Transformer Constraints: Zuzanna likens large language models (LLMs) to “interns that never get better with experience,” as transformers lack enduring memory and adaptability ([02:23]).

"It's a bit like having an intern that you hire on the first day... she may be brilliant, but she stays an intern forever on the first day of her job. Doesn't get more context, doesn't get better with experience." – Zuzanna ([10:30])
Current Capabilities: LLMs are restricted to “short-term” reasoning, executing tasks spanning only a few hours before internal context is lost ([10:30]).
Urgency for Memory: The Pathway team views enhanced, structured memory as a core vector for the next wave of AI progress.

2. Background & Motivations

Interdisciplinary Lens: Zuzanna’s foundation is in complexity science, focusing on emergent phenomena within networks—studying how global order emerges from local interactions over graphs ([04:53]).

"A graph is a network... dots represent nodes and then you have edges that link nodes... it's an evolving system where things happen locally, not necessarily always planned." ([04:53])
Team Composition: The Pathway team includes experts in attention mechanisms from the pre-transformer era, quantum physics, and theoretical computer science ([04:53]).

3. Inside the Dragon Hatchling Architecture

Post-Transformer Shift: BDH organizes memory in the synaptic edges (connections), using principles akin to Hebbian learning (“neurons that fire together, wire together”), enabling persistent, learnable memory during inference ([14:33]).

"We have neurons... linked with synapses. In a traditional transformer, everybody is connected to everybody... Here, not everybody is connected to everybody. We have a graph structure of relevant connections... trained with signal..." ([14:33])
Sparse, Localized Computation: Unlike transformers’ all-to-all attention, BDH utilizes a sparse graph where activations and memory are shared only among locally connected nodes ([14:33]; [25:50]).
Plasticity: Synaptic plasticity underpins learning—connections strengthen with use and fade without, echoing biological neural networks ([41:23]).
Architectural Efficiency: The structure allows for massive memory capacity without the need to scale up the number of neurons, focusing instead on the richness of synaptic connections.
Emergent Topologies: Over time, the graph tends toward scale-free network structures, offering resilience and efficient communication similar to natural networks ([35:47]).

4. Comparisons & Analogies

Difference from Mixture-of-Experts: BDH’s modules (experts) can be as fine as a single connection, discovering their roles organically, unlike hand-engineered expert subgroups in mixtures ([33:53]).
Brain-Like Reasoning: Local activation and threshold-like mechanisms reflect biological neural principles, bridging neuroscience and artificial learning ([32:30]).

"We are offering plausible brain-like explanation for how reasoning may appear." – Zuzanna ([32:30])
Graph vs. Layered Networks: Traditional neural nets have rigid layers; BDH is more flexible and organic, allowing new knowledge to be “packed” in the structure without overwriting previous skills ([28:36]; [41:23]).

5. Continual Learning, Catastrophic Forgetting, and Robustness

Locality and Memory: With vast, sparsely-accessed state, BDH avoids catastrophic forgetting, enabling continual learning without erasure of prior knowledge ([25:50], [41:23]).

"You may have huge state and use very little of it at every step because of the local dynamics... we found a way... to make it work on GPU and make them seem like dense matrices, but actually... preserving the sparsity..." ([25:50])
Robust to Network Damage: The system’s function can redistribute after loss of nodes/connections, similar to resilience found in natural brain architectures ([41:23]).

6. Memory at Inference and Long-Horizon Reasoning

On-the-fly Learning: Learning isn’t just during training—connections strengthen at inference, continually refining memory ([54:23]).

"All the dynamics that we talked about happen at inference... because you keep the memory, you can keep consistency... you have a structure to hold on to." ([54:23])
Reduction of Hallucination: Maintaining context over long periods should greatly decrease hallucinations compared to standard autoregressive transformers ([54:23]).

7. Creativity, Emergence, and Generalization

Potential for Creative Reasoning: Zuzanna expresses hope that structure, plasticity, and emergent representations will enable true innovation, “eureka moments,” and generalization over time—going beyond just remixing learned patterns ([46:28]).

"The real innovation wouldn't be just recomposing things that exist, but seeing the loophole... come up with eureka moments... this will go through having some sort of internal representation that allows for it." ([46:28])

8. Performance, Scaling, and Productization

Performance: Early BDH models, with 1B parameters, match or outperform transformers (e.g., GPT-2 scale) on language tasks, often requiring less data ([37:39]).
Scaling Laws: BDH adheres to favorable scaling, but does not “parallel the endless scaling of N” (number of neurons) seen in transformer races ([55:51]).

"We're not playing the scaling game in the sense that we don't need to grow the N so much... the power won't be coming from the scale." – Zuzanna ([55:51])
Real-World Applications:
- Infinite Context & Enhanced Reasoning: Suitable for tasks requiring long-term, dynamic memory, small but high-value datasets, continual learning, and explainable, personalized reasoning ([50:28]).
- Enterprise and Regulated Domains: Early use cases include healthcare claims resolution and nuclear industry documentation ([50:28]).
- Industry Collaboration: Partnership with AWS and Nvidia to bring the first models to cloud customers ([49:37]).

9. Model Merging and Compositionality

Mergeability: BDH models, being graph-based, can be merged along the node/connection axis, akin to composable software modules; this could allow combining specialized models post-training ([63:26]).

Notable Quotes & Memorable Moments

Transformers & Memory:

"They [transformers] work a bit like a Groundhog Day. I mean, they wake up every day with their memory... wiped out." – Zuzanna ([10:30])
Organic Network Growth:

“You have this loops, you know, in a way that—well, does the network structure what's happening or is it, is it the other way around? Kind of chicken and neck problem.” – Zuzanna ([04:53])
On Generalization & Creativity:

"The most imminent generalization that we are after is generalization over time. The real innovation wouldn't be just recomposing things that exist, but pretty much seeing the loophole." – Zuzanna ([46:28])
Plasticity and Resilience:

"Function shapes the network... sometimes swap the nerves and the function of them—the stimulus... shapes the network." – Zuzanna ([41:23])
BDH vs. Transformers:

"We actually use it very locally... This is key. This is very different to everything else that happens in the AIs, far as I know, because we are really focusing on the local dynamics..." ([25:50])
On Network Merging:

"...We glue them along this one dimension N... we actually on language we do... glue the two models trained into two different languages and then all of a sudden they start to speak some sort of like Creole..." ([63:26])

Timestamps for Key Segments

Transformer memory limits & analogy – [10:30]
Complexity science background – [04:53]
How BDH stores memory in connections – [14:33]
Sparse, local computation & synaptic plasticity – [25:50], [41:23]
Brain analogy and spiking/rumor-spreading – [32:30], [28:36]
Avoidance of catastrophic forgetting – [41:23]
Hallucinations and inference-time learning – [54:23]
Possible for creativity & emergence – [46:28]
Industrial productization and AWS partnership – [49:37], [50:28]
Model merging/composability – [63:26]

Closing & Further Resources

Connect with Pathway and Zuzanna:
- LinkedIn (Zuzanna)
- researchpathway.com
- [Dragon Hatchling paper on arXiv and GitHub] (see episode notes for links)
Advice:

"LLMs are doing a phenomenal job of this paper especially, especially the reasoning ones, once we ask them to properly read the paper... I would strongly encourage you to use AI to read about AI." – Zuzanna ([65:54])

Summary prepared for listeners seeking a comprehensive understanding of episode #326 of Eye on A.I. with Zuzanna Stamirowska.

Loading summary

Transcript59 lines

[00:00]
A
So it's a bit like having an intern that you hire on the first day of her job and she may be brilliant, but she stays an intern forever on the first day of her job. Doesn't get more context, doesn't get better with experience, doesn't get better over time.
[00:13]
B
Do you think there will be creativity in this kind of a network will be able to think beyond the fixed parameters of an LLM and come up with new ideas?
[00:25]
C
This episode is brought to you by Tastytrade on ionai. We talk a lot about how artificial artificial intelligence is changing how people analyze information, spot patterns and make more informed decisions. Markets are no different. The edge increasingly comes from having the right tools, the right data, and the ability to understand risk clearly. That's one of the reasons I like what Tastytrade is building. With Tastytrade you can trade stocks, options, futures and crypto all in one platform with low commissions including zero commissions on stocks and crypto so you keep more
[01:10]
B
of what you earn.
[01:12]
C
The platform is packed with advanced charting tools, back testing, strategy selection and risk analysis tools that help you think in probabilities rather than guesses. They've also introduced an AI powered search feature that can help you discover symbols aligned with your interests, which is a smart way to explore markets more intentionally. For active traders, there are tools like Active Trader Mode, One Click Trading and Smart Order Tracking. And if you're still learning, tastytrade offers dozens of free educational courses plus live support from their trade desk reps during trading hours. If you're serious about trading in a world increasingly shaped by technology, check out tastytrade. Visit tastytrade.com to start your trading journey today. I'm going to myself. TastyTrade Inc. Is a registered broker, dealer and member of FINLA, NFA and SIPC.
[02:24]
B
I mean, I've done some reading. I'm not sure I entirely understand everything yet, but I've been interested in self improving AI along with everybody else and continual learning and those things. Post Transformer Architectures I've spoken to a few people Carl Friston on I don't know if you followed his work, but he's working with a startup called Versus and then there's a company called Manifest AI in New York City. I don't know if they're at all related to what you're doing, but in any case I'm interested. And you guys are looking at this specifically for robotics applications or.
[03:16]
A
No, not necessarily. So first of all Craig, thank you so much for having me. It's a pleasure to be here. You also had a chance to talk to wonderful guests. We actually do follow the work of the guys at Manifest AI. It was actually great to hear them on your podcast as well. We don't necessarily focus on robotics. In fact what we focus on and the part of maybe some sort of obstacles that we see for AI that are pretty big, that we are lifting and hoping to lift are linked to memory. Not just continual learning, not just self adapted systems, but memory and then memory leading to enhanced reasoning. It's slightly more than just allowing for a larger state or, or the capacity of learning over time. It's part of it. But memory can also power reasoning and having, let's say memory organized in a different way. We can get the representation. This is something that folks call that internal representation that may potentially be more interesting and easier for reasoning.
[04:40]
B
Can you give us a little your background before we start talking about baby dragon hatchling? Pretty, pretty creative name, of course.
[04:54]
A
So I would love to actually, I would love to also maybe set the stage a little bit. Yeah, why we're looking at those, those problems before jumping into, you know, what we've done. Because I, I think the most, maybe the most interesting thing would be to, to establish some kind of common ground for, for what we see is happening. And I feel like many people across a number of labs by now also also are seeing and some of your guests as well. So as for my background, I'm actually complexity scientist, so I worked at the Institute of Complex Systems of Paris. I am specialized in studying emergent phenomena. So seeing how some sort of global order emerges from small local interactions and how locality matters in appearance of phenomena. So and this may happen on different sorts of topologies. I mean, in complexity science we usually look at graphs. So graphs is just a graph is a network, right? You have like dots which represent the nodes and then you have edges that link to nodes. This can be your road network in a city, right? This can be your Internet, this can be your social network, this can be your network connecting neurons and something happens on those links, right? There is some sort of, I know, cars, travel, mail is being sent, right? Water travels through pipes, whatever, but something, something happens. I mean, sometimes there is some operation being done on the link. There's some sort of function that then also drives the evolution of this network. So it's a system that may be, you know, evolving following the local interactions that happen on the network. So I know for roads we would look at, okay, somebody built a road in one place, but maybe they built A road between two places that, where two types of objects were produced to exchange. Right. But then there was a settlement somewhere. So okay, then we built a road again. Since there was a road, people started producing things because it was easier to trade with that location. So you sort of start to have this loops, you know, in a way that well, does the network structure what's happening or is it, is it the other way around kind of chicken and neck problem. Thing is it's an evolving system where things happen locally are not necessarily always planned by you know, a super, super designer at the global level that says what should be happening, which function should be done where you rather have this organic emergence. And this is, this is what I actually focused on early on in my career before that I worked and actually what led me to it was game theory. And game theory played on graphs. So you can have players, but players play on not like one to one or many too many, but they play on, on potentially complex topology and not, not against everybody at every step. But I also was trained at the French school, at the school for French politicians in my past. I also got a chance to teach macroeconomics for example. Yes. So yeah, this is the, I think a very, very useful type of education as I kind of see it right now, being a CEO of a startup. And then you know, when you, when you look at AI and many implications that it, it has and, and probably trying to project where we can get with AI as a society and also as, as countries and how this may impact maybe even the concepts of how states collaborate, etc. I mean perhaps at least I have some anchoring to start to think about this. So this is my background and I have a chance right now at Pathway. I mean I'm a co founder and also a co author of the BDH architecture. I have a privilege of leading a team of fantastic researchers in AI with sometimes various backgrounds. So our cto, Jan Khorovsky was the first person to apply attention to speech recognition. So that was in the pre transformer era and he was at Google Brain. Our cso, Adrian Kossovsky is a quantum physicist and a theoretical computer scientist at the same time and he has his PhD at 20. And you know there's, there's an entire team built of algorithmicians, people with background from physics as well. And I'm, I'm just, I'm just you know, extremely actually proud to be, to be able to meet them.
[10:06]
B
And what were you doing? I mean, you know you wrote this paper on Dragon hatchlings. I'm Actually looking. I don't remember when that was, but. And it was, that was in September. Was that before or after you founded Pathway?
[10:30]
A
No, we found this pathway way before. So we've been working on this, on the problem of memory and intelligence for, for a long time. And this is just, let's say the first glimpse into, into what we're doing internally. So if you look at LLMs, right, and this is, this is something that I think is becoming more, more and more public, I'd say, or maybe more people are becoming aware of this and, and in general public, like all the elements are pretty much based on the transformer architecture. That was a huge unlock for AI when that enabled like all of us pretty much to have the imagination of what could be possible with AI. The transformers are probably not, and I mean as Papua would argue, are not the last architecture in this entire AI market shift. And specifically from what we're seeing, and this is something that we kind of saw early on, they're missing memory. So they work a bit like a Groundhog Day. I mean, they wake up every day with their memory of their interactions of whatever problems they were solving being completely wiped out. So it's a bit like having having, you know, an intern that you hire on the first day of her job and she may be brilliant, but she stays an intern forever on the first day of her job. So doesn't get more context, doesn't get better with experience, doesn't get better over time. And then also isn't capable of staying focused on a task for a very long time without, without falling into hallucinations. And then of course, I mean with capacity of saying, let's say focus and coherent on solving a task, the complexity and the value of tasks that AI can accomplish will be, will be growing. And there's a lab that is called meter that is measuring like the length of human tasks that AI can do right now successfully. And we are at a couple of hours right now. And of course there are so many tasks that would in reasoning, for example, will take it way long, will take us way longer potentially. Also with changing inputs, data coming from the environment, right. Not all the problems are closed form kind of math problems. Some are open, highly dependent on context, some may depend also on small data. So yeah, so this memory, but I mean we see it also from the market perspective as let's say one of the vectors of the A market shift. And there are people who of course work on the topic of memory from the hardware side and we are working on this from the algorithmic side just
[13:44]
B
to back up a little bit and pull out to a wider view. So you're working, I mean these are your language models are neural networks. But the, the difference is instead of the information being coded, and I'm guessing now, or from my understanding being coded in the weights of the nodes, the, it's the edges, the, the connections between nodes that strengthen over time and in effect create memory. Am I getting that at all? Right.
[14:34]
A
It's a bit more, in a sense, it's a bit more because of the learning mechanism that we have. And as we started this discussion with talking about locality, the concept of locality and how the system works is very key. So we have a graph structure, a graph kind of structure of connections between neurons. This is true. So maybe what we're doing right now, we're actually discussing the architecture, the post transformer architecture that we published in the paper in September, which works a bit like the brain on gpu. So it follows the rule of Hebbian learning. This is a simple brain like model and we managed to get it to GPU and it actually trains even better than transformer when we compare it to the GPT2. So this is the transformer architecture. We compare architectures to architectures not yet models to models. Right. And at 1 billion scale and has a number of great properties. We designed it in fact to have memory and then to align it with the way we see reasoning, let's say unfolding so we can discuss it a bit later on. So to give you maybe an image of how in this system learning works, we have neurons, a bit like in the brain, right? And then the neurons are linked with synapses. I mean you can imagine wires or you can imagine roads. Right. In a traditional transformer, everybody is connected to everybody. So imagine a conference call where everybody is connected to everybody. And if you want to double the number of participants, you have to quadruple complexity of everything that's happening. Well, here not everybody is connected to everybody. We have a graph structure of relevant connections that are trained actually with signal that comes into the system. Whenever a signal comes to the system, the neuron which was, let's say triggered, sends over a message to its neighbors over those wires. And they're not connected to everybody. They're connected, let's say just, just to their friends, like you on Facebook, for example, right? They, they send over their message, the message just to their friends according to some rule of a threshold, the friend is interested or not so fires up potentially receiving, receiving the message. And then in the following step this disconnection between them becomes stronger because it's been proven to be relevant and it helps to create certain shortcuts or kind of wider roads. Okay. Like fast, high speed roads between certain, certain places in the network.
[17:50]
B
Yeah.
[17:52]
A
And in that way during learning you get optimization of the space. I'm not saying this very formally right now, so please don't hold me to it, but it's somehow more optimized space of connecting things. And we find that this structure optimizes itself naturally during training for communicability. And this is in fact very often a feature of complex systems that are, that are grown organically. You need to. The systems naturally like to balance communicability and let's say the cost of maintaining the links. So it should be usually, you know, easy for you and quick to reach everybody who's important on the network quickly. And then, you know, in this structure like the links can appear according to. We don't control the rules according to which they appear. So it may be the information, you know, you were at the cafe with a friend and that friend told you something very interesting about neuroscience. That was my case kind of actually when I started this research. And, and you will have a connection between the taste of that coffee and you know, something that happened in the brain of a mice of a mouse. And you may have this connection. It's not a very formal one, it's the one that you kind of have. And then it may help you to do some sort of informal thinking and just kind of exploring this space of, I don't know, related concepts according to whatever types of links that you could have found while learning.
[19:39]
B
Let me just ask some dumb questions to try and triangulate where this sits in the graph of my knowledge. The, the. Because I've, I've been, I had a conversation recently with David Ha at Sakana AI and they're talking about, you know, evolutionary systems that, that can grow new neurons in effect to add knowledge and you know, avoid the catastrophic forgetting of fixed LLMs in your Baby Dragon hatchling system. Two questions is that what happens as you learn new things? Do you, does the system add new nodes or is it all in. In the strength of the connections in the. Along the edges? And, and how does, how does that strengthening or weakening and mathematically is there. Because in a standard LLM there's a function in the node that encodes knowledge and could be wiped out with retraining. But is there a function in the edge then rather than just.
[21:27]
A
This is a very good question. This is one that our CSO likes very much. And I'd say it's mostly coming through him. So we have something, and I mean, this will be perhaps like pretty technical, but an important distinction for intuitions when one is thinking about such systems. And I think it's not very often the case when we talk about LLMs as a duality between operators and the state, the intuitions, the way it was actually, apparently, I mean, I'm just repeating after our cso because this is his part. This in fact comes from quantum physics just, I mean, historically in science, how, how it's being developed. And in this case, the state sits on the edge and on, on, on the fast weight, as we call it. And then we have the equivalent of it, which is a slow weight, which is a parameter. And this is kind of key. And what happens are that neurons are computations, but really the state sits on the edges, on the fast edges, mostly. So this is how we look at this. And what happens is that while with every step you actually also update. You use the network topology and weights for message passing, but then you also, after you've done it, you update your fast weights and you update the slow weights, let's say, at the slower frequency. And this is. Yeah, this is exactly.
[23:19]
B
And if there is no connection between two nodes, but as you said, you know, maybe there's a node for a coffee shop, maybe there's a node for, for. I mean, this is obviously not a realistic analogy, but maybe there's a node for you and for the person that you're talking to, but the, the person you're talking to is connected to the coffee shop and you're connected to, to that person. How does the system know to build a connection, an edge or whatever?
[24:00]
A
Yeah, this is, this actually very, this is a very good question. So I mean, we've been looking at it from, from both, both perspectives. Important to know that concepts, let's say if you think about concepts, concepts sit more on the edges than on the notes. So, so, so we're looking more at the synapses where, where the concepts are, are, are, are sitting. And then we would say that, okay, actually if a concept is important, they will have one node responsible for that concept, one note, one synapse actually responsible for that concept. And this is what we find, what we demonstrate with one, like with an example in the paper, we have the concept of currency and we find this synapse firing up. So, yeah, sorry, can you repeat the question?
[24:47]
B
Yeah, I mean, how do you build a connection between two nodes? If there's no connection.
[24:53]
A
Yeah, exactly. So the nodes, we, in practice, we usually keep the N as fixed. We keep the N as fixed. And there is an initialization rule at the beginning. So this is usually when you start growing network, you need to kind of start with, have an initialization rule for the very kind of first basic structure to appear that then starts to iteratively kind of grow and enhance. So there is a, I mean, by now I think we know, we tested a great number of them, but there's like a simple and simulation rule which is linked to, okay, how many attempts you have at creating your first links at the very beginning.
[25:37]
B
And then as the network, how does the network grow? I mean, this is what I was talking to David Ha about. How do you. Or, or it don't. Doesn't it grow? And the network.
[25:51]
A
So our network, you may imagine it doesn't grow so much. I mean the N, the, the, the number of. The size of network defined as the number of neurons in the brain and in practice for us stays fixed. But since it's a graph, right, you can pack a lot, a lot of synapses between them and then you can also. And so in the brain we're looking at hundreds of trillions of synapses for billions of neurons. So this gives us a very, very, very large state that we can use at, access and use efficiently at any time. Because we don't always need to use the entire state, we actually use it very locally. Thanks to the local dynamics. We may have huge state and use very little of it at every step because of the local interactions like the local dynamics. This is key. This is very different to everything else that happens in the AIs, far as I know, because we are really focusing on the local dynamics and we found a way to kind of mash this, apply some makeup and make it work. This skin has sparse things to make them work on GPU and make them seem like dense matrices, but actually being kind of preserving mathematically the sparsity of interactions that we have. And I'm happy to jump into this. So yeah, so point is local interactions and if you do it, you effectively have the context size, which is the size of your model. But you don't use always all of that. You use just the, the tiny bits that are, that are necessary for you. And then in terms of catastrophic forgetting. Well, one of the arguments is that once you learn new skills, you kind of have so much space to pack things that you don't impact, you don't necessarily impact the ones that were linked to your previous task. I mean, it's too early for me, you know, to talk more about like the solving the curse of catastrophic forgetting and continue learning. But, but let's say that yeah, we see this dimensions being very, very helpful.
[28:10]
B
And the art. The structure of the, of the. You know, a traditional neural network is, is in a very. It's very structured. There are layers and there's the width of each layer. This is much more organic. Is that right? It's not in layers, it's in a graph. So there's.
[28:37]
A
Yeah, exactly. So it's in the, it's the, in the ideal scenario. Right. And the pure thing to really, to build your intuitions and to really explain how and why it works. It's exactly a graph with local dynamics that work like a rumor spreading on networks for technical listeners. I mean, I would invite you to maybe have a look at John Kleinberg's and Avatar Dosh's papers about rumor spreading. So for example, like why certain neurons fire up. I mean the rule is linked to thresholding. So you need to reach a certain threshold, in fact of. Of relevance of information to activate the neighbor, et cetera. But it's really like spreading a rumor on the Internet, right? You cared enough to pass over to your friends and et cetera. Some people didn't care. Or a bit like epidemics. So this is the logic then when you bring it to gpu. And I get questions about this because of course was being seen with the paper and online on GitHub is just a small portion of. Of our work. What do you see in the code and in the very GPU friendly implementation that we have? There are ultimately we have to talk about matrices and it looks very much like a transformer. But there are important differences that happen. Attention is linear. So when you look at sizes of let's say our matrices, the traditional transformer has this large square, right? And there will be squares and they're dense. Whereas for us it looks. The only dimension that we care about is the N. That's the number of neurons effectively. And the objects we work with look like very kind of narrow but very long snakes. Okay. We have those huge squares in transformers that grow quadratically and we have this very long snail which, which effectively we only care about the length, which is N. And this is the number of our neurons. And then to make it sparse because of course we had to, as I said, we had to apply some makeup to make it work on GPU that really likes dense matrices. We actually multiply it by A positive sparse vector of activations. And this once you actually multiply one for the other, you get back your graph. So this is for the listeners who might be asking, okay, where is the sparse hidden in the implementation that is GPU friendly where actually mathematically it is there. So you get it back and hence you can actually do a lot of things like spot the synopsis responsible for specific concepts, et cetera. I mean sparsity is very much there. Even though if you just look at the code like this, you might not spot it because it's hidden in one specific multiplication. And this, the fact that we have positive, sparse, positive vectors of activations, this is a very big, very big difference that I mean our CSO loves to explain as well. I mean I think once he used power of macaron to explain the kind of the space of those vectors and the.
[32:20]
B
You talked about activism. You know, you need a certain amount of signal to activate a neuron that sounds like spiking neurons is.
[32:31]
A
It's an edit. This is not, I mean intuitively, I mean not in like all the implementation details, but it's intuitively it's not too far. We're getting to. So what actually responds to our paper was especially good from the neuroscience community because indeed we kind of start to show the way of bridging very non organic AI with the models and the thinking of how neuroscientists see brain activity. Of course we're not getting to the levels of chemical interactions like reactions that happen in the brain, but we are offering plausible brain like explanation for how reasoning may appear.
[33:24]
B
And then the sparse. The activation of when the network is working. I understand that it's very localized. You're not using the entire network. It's whatever the edges connect. That sounds a little bit like mixture of experts. Am I wrong there?
[33:54]
A
This isn't just at a very at infinitely small, I mean not infinitely, but really tiny granular scale. So if you were. Because mixture of experts, you know, is more kind of defined from top down here imagine that your expert can be even just one, potentially could be just one edge. And you don't decide expert groups that are being decided on their own kind of organically while training. This is important. This is also maybe coming really deeply from the complex systems intuitions. We really don't impose any structure on the system. The system does what the system wants to do. And that's the point. That's the magic of it. This is also why we believe this kind of ultimately the space that's created can be very interesting. For reasoning. Because at the same time the concepts appear, the more important ones are strengthened and gives you a topology potentially with shortcuts to explore. And if you believe that reasoning is a way, some sort of search in the space, then then this gives you an interesting topology to explore and it's aligned with the method of learning itself.
[35:19]
B
And then as, as the system learns, edges become stronger or connections become stronger or new connections are formed. So the graph becomes denser not in the number of nodes, but in the number of connections.
[35:47]
A
Yes, go ahead. Plus some may fade over time. Yeah, so yeah, if you don't use them, I mean you have only positive activations, but you may have fading as well. So in a way, empirically what we find is that we get like almost a scale free type of, type of distribution of degrees of nodes which gives. This is a type of structure that we find in real world organic networks quite a lot. Because this is one that's kind of known to be pretty resilient. Resilient and with good communicability. And also it behaves similarly wherever you zoom in in the network. So it kind of, it has like similar properties irrespective of its size.
[36:41]
B
And how, how you said you, you built like a, a billion parameter model using this architecture, is that right? How, how is it so not only how is its performance, but I understand memory and in transformers the intention mechanism, but how the memory here is the strength of the connections. Right, so, so yeah, so but how, how, how does it perform? For example in is it's not autoregressive, right? Or is it how, how does it perform on language? I mean you, you built a language model with this, right?
[37:40]
A
Yes. So this is, we train it on predominantly on language. So I mean we're not talking about division models, for example. Right. Like there is, there's an entire line of, of of research and, and right now attracting a lot of the attention which is actually linked to vision and, and world models per se. We actually believe that, we believe that there is maybe, maybe a way to bridge the two. But yes, we're looking at the, we're looking at the language model and we compare it on traditional language tasks. So these are the benchmarks that we, or the tests that we are showing in the paper. We're looking at translation, we're looking at how it behaves at traditional language tasks a bit the way in which transformers were developed. So the question of scaling it further in fact and the features that we see more likely coming out of it, out of this architecture scaled well Scaled or developed into a model, into a product is let's say the power of your output with way less data. So with a way smaller model and way less data, we should be able to get to the two same results. I mean compared to GPT2. So I mean comparing apples to apples, architectures to architectures on the same data sets. Exactly. Same data. I mean we're actually learning sometimes even like faster than transformers and scaling laws are preserved. Knowing however, that we don't need to grow this size of neurons so much and we can be adding, I mean we have state right, which is huge and which is huge. While we can actually use it efficiently. So it's both packed efficiently because it's packed in the graph between the connections between finally not so many neurons. And then because of the local dynamics, we only access a bit of it at every, let's say, step.
[39:59]
B
And again, some more dumb questions. So I can understand this. I mean in, in the brain, these local networks that are activated represent concepts, perceptions, all of that memories and in, in your system. But, but it's not. Let me think. How am I going to say this? It's it. The, the knowledge that's stored there is. Depends on, on the connections between these neurons in the knowledge kind of emerges from those connections. Right. It's not obviously not stored explicitly. And does, does that. Is there some plasticity in that? Like if you cut a bunch of connections, you still have the knowledge. Maybe it's got to grow stronger connections again. But yeah, what can you talk about?
[41:24]
A
So this is, this is, this is, this is a beautiful question. I would like to. Your questions are actually very deep, so definitely not what would qualify, you know, as dumb question. This is, this is, this is dramatically important. I'll take it in two ways. So the very, the very concept that we have here is synaptic plasticity. This means that not only a message is passed through the network or through a connection, but then this connection becomes stronger because it was triggered. And this goes on perpetually. Somehow as the system lives, we have this effect and then there's this update to the slower weights that just change at slower frequency. But then for the plasticity of the network, this is, this is a brilliant topic actually for complexity science and network resilience in general. And one key intuition that actually this one came from me, I mean most of this research actually came from my colleagues and our team, but this one came from me is that function shapes the network. And you actually may have. And this is something that that neuroscientist at that coffee shop told me you may, you may sometimes swap the nerves and the function of, of them though, like let's say the example he was giving me was the auditory with the visual nerve. But then the mouse will turn out just great because the, the stimulus like the, the stimulus that was kind of given shape the network because the network's job is to transport it in a way. So it's getting, getting these local like local functions of what the network, what the edges, what the nodes are doing that then help you to potentially even compensate. So I don't think we've run like deep enough studies of what's, what's happening even, even in our model right now in terms of its resilience to deletion right. Of nodes. But I've done a number of studies of how such systems work and how scale free networks behave when you delete. Two intuitions is that function shapes network. There's a good chance that some functions will be overtaken by the locality and this will also happen according to the rules of local connections like let's say common neighbors between the nodes. And this is like some redistribution maybe of function is likely to happen. And yeah, so this is a very deep topic, very, very interesting. And then as for concepts appearing, I mean, yeah, there's a number of things that happen with the fact that we're just working with, they mentioned n. But this is maybe jumping across topics so.
[44:34]
B
Well, actually I was gonna jump but. Well, I, I will. So you're in, in complexity studies and you're studying emergent properties and the obvious thought is that in consciousness studies, you know, a lot of people think consciousness is an emergent property of the, of the complexity of neural interactions. Do you have any thoughts on that? I mean I know that's, that's in left field, but are, are there or, or rather in a less sort of unanswerable way of. Is there, do you think there will be creativity in this kind of a network in that the network will be able to think beyond the fixed parameters of an LLM and come up with new ideas. And this is something I was talking to David ha about in, with evolutionary systems in general. They're not, they're, they're looking across a landscape not simply, you know, following gradient descent to a. Yeah. To a, to a local optimum or something. So, so they can jump over across to find, you know, ideas that, that didn't, wouldn't necessarily emerge from a fixed LLM. So yeah,
[46:29]
A
yeah, absolutely. So I, I don't think I feel nearly qualified enough to have a position of how consciousness appear, appears in the brain or how it can, you know, appear ultimately in a lens. I mean, it is somehow completely by accident. I started reading about Heidegger over Christmas and, and I think, I mean I don't even know the advance, the advances in philosophy lately, but the definition of being and thinking may be somehow impacted by, by, by what's, what's happening with the reasoning models. So are just my, you know, kind of almost cultural thoughts when it comes to generalization? Of course this is the main goal. This is why we're doing all of this because the, the real innovation wouldn't be just recomposing things that exist, but pretty much seeing the loophole. And, and, and, and that this is then when, when you know there is something interesting that should be added. How do you know that there is something, something interesting that should be added that you need to actually somehow modify what's your, modify your topology. Right. And yes, so we are, there are two ways to maybe generalize the general. The most imminent generalization that we are after is generalization over time. So this is that this is the kind of capacity to maintain coherent reasoning over, over time without falling into hallucination and just doing something completely silly. That's the intuition. And then generalization in terms of having really, I mean internally we call it the model, which is a real innovator, right? One that can come up with eureka moments. And these eureka moments, they don't result from very formal thinking. I mean, I believe that very few mathematicians really kind of start their proof and work. It's formally step by step reaching, reaching the, the conclusion. I mean usually they have in different, different bits of information of conviction filling in dilemmas on the way. Sometimes are even like strategic proofs. Like how so I know mathematicians who actually design ways to, strategic ways to, to get to their proofs. So yes, and I think that this will go through having some sort of like an internal representation that allows for it. I mean our hope is that having this plasticity and the structure, which is not 2D is not 3D is actually somewhat efficient and emergent will, will make it easier. And then there is of course a lot of work to be done for this.
[49:26]
B
So what are next steps? I mean this is not productized yet. It's still in the research phase. Is that right? And where are you guys headed?
[49:38]
A
Correct. Knowing that we're actually right now working on productization because I mean naturally it unlocks a lot of, a lot of interesting properties. I mean some of the first ones Are infinite context windows enhanced reasoning. In some use cases it can be jump from zero to one. So we actually partner with Nvidia and AWS something that we announced at Rainvent in December. And the moment the first model will be ready it will be immediately available to AWS customers. And yeah, so this is. Let's say this is being productized as we speak. Knowing that we're productizing along
[50:22]
B
what kind of use cases do you think the first iteration would be applied to?
[50:28]
A
So the very good use cases for this are the ones that are linked to small. So we're looking highly valuable small data where we actually want to. We were doing for example research about about or you make new designs for. I mean we actually saw a use case like this for. For in the nuclear space. Right. You don't have a lot of. A lot of documents but those that exist are actually very, very highly valuable. And you. You would like to get some ideas for. For the current engineers. Right. And ideally you won't learn it very very effectively from because there's just not enough data. And then we're looking at cases where you have continual learning and data that kind of keeps on changing where you want to deliver value. And then also reasoning which is slightly more kind of complex because you. Which is highly personalized. So the use case that we look AT is e.g. healthcare claims resolution. This is because of the fact that okay, it has to be very personalized. It is complex reasoning because you need to within the context of what's linked to the reprocess of the claim resolution, all the information that we that was given, you kind of need to. You need to resolve it. And also you need to be able to explain why and how. And because we kind of see, we have. We have some advantages in terms of interpretability because we kind of see the synopsis that light up. And then also in reasoning, I mean you can. You can have layers of let's say explaining of what happened. We have some benefit like big benefits for the regulated industries. So but generally speaking think about small data with time changing elements.
[52:38]
B
Yeah.
[52:40]
A
And potentially next best action suggestion. Yeah which is you know, or like sort of reasoning.
[52:48]
B
And in terms of the traditional, traditional, I mean they've only been around for a few years but LLM weaknesses hallucination. Does this eliminate the the problem of hallucination? Because hallucination comes happens during inference when I mean from my simplistic understanding when the probability distribution of the next token doesn't include the correct token and gives something that's in the distribution. But then that error is propagated as you go along. First of all, is that right? But second of all, in that you're not predicting the next token from a probability distribution. I don't know if reading's the right word, but you're, you're looking at an existing activation of existing sub network within the network or within the graph. Yeah. How does that affect hallucination?
[54:23]
A
Yeah. So just actually one of the big motivations for this work was to limit hallucinations. Right. And especially hallucination over time. So important to note because I think we didn't stress it at the beginning. All the dynamics that we talked about happen at inference. It all happens at inference. So indeed this is the hope that because you keep the memory, you can keep consistency and, and as you go on, you don't fall over very easily. I mean you have, that's like a structure to hold on to. I mean it was too early for me to communicate results.
[55:14]
B
Yeah. And you said that it follows the same scaling law laws or similar scaling laws to Transformers, which was, was a huge advantage for Transformers. And, and you know, what they did is just build bigger and bigger and bigger networks and it got better and better and better, at least to a certain point. Do you think the same will happen with bdh with baby Dragon? What is. You know,
[55:52]
A
actually the paper is Dragon Hatchling. So it's Dragon Hatchling and the acronym is bdh. And I think people are very, very just tempted to put a B. Explain to me. So it's Dragon Hatchling. We are not playing the scaling game in the sense that we don't need to grow the end so much. And the ultimate goal. But this is, you know, this is, this is the goal of almost everybody. For us the path goes through memory, but the ultimate goal is to get to generalization and generalization in reasoning. So yeah, we are focused on reasoning models and seeing actually not the scale per se because the power won't be coming from the scale. So yeah, point is we don't need to. We're not playing the game of having our n. You know, being humongous because I mean that that's precisely not. Not the reason why we're doing it. I mean this is a way more efficient way of doing of, of of learning and then, and then evolving and storing memory hardware sense. It also sits on the, and that this is a big advantage. And if you look at those like if it sits on the chip, you do less lookups and you know, you may divide your compute for reasoning 10 times for the same output token. Right. Plus you can probably do reasoning on cases which are more complex because you have consistency over like consistency you have capacity to work for longer without falling off and into, into hallucinations. So it may be a more sustainable, I mean we believe it's a more sustainable way forward also in terms of, in terms of compute and how it distributes. It's a distributed system. So there is a part of theoretical computer science like a very big community actually that, that deals with distributed systems and it is itself for those listeners, if they're here this effectively in like a distributed system
[58:08]
B
the is is there. The reason you're not looking at scaling is because the number of connections between the nodes or the neurons in this graph can, can you can pack enough in that you don't need to scale. Scale the number of parameters or, or I mean it could be that if you scale you'll, you'd be able to contain that much more information in.
[58:45]
A
It's possible. It's possible. So I'm not saying. But we don't need, we don't need to reach this. Our success is not blocked by the scale, by the, by the size of N. I would say it this way. So for us focusing on the larger end it's probably fun but I mean our goal is in fact to get to this reasoning in places where it wasn't possible before. And so I'd say we have a slightly different objective and for this as of now we didn't see the scale as being the main bottleneck. And yes, this structure allows us to pack a lot in the connections not in the, not in the. In the end but in the, in the synapses and then keep it apart enough because we activate only parts of it and such that when we touch one part we don't necessarily touch the other part of the, of the model. So I mean hence, hence some of the nice properties for against catastrophic forgetting and like the plasticity is very, very important in this entire world.
[59:49]
B
Work, Work, you know, with transformer architectures, once they were validated and they exist, then there's this whole second layer of activity in combining models or combining different kinds of models into larger systems. Do you think this would work with traditional LLMs? You know, maybe, I don't know, maybe the transformer architecture handles one thing and the post transformer architecture handles something else.
[60:31]
A
Oh in that sense. So I mean in terms of, you know, of maybe putting models together, I mean at the engineering level I believe there will be different use cases for which different models will Be used and for your traditional knowledge type of use cases where we have language models, I mean they are doing phenomenally well. And I believe that a lot of the right now chatbot types of use cases, I mean they will stay with the LLMs and they will be doing great. So I believe that the models based on BDH and when she had like other architectures, I mean they will be used in use cases that for example require consistent reasoning over a longer time. Right. So within data. So we're definitely looking at enterprise use cases naturally or deep innovative research where you potentially with also changing inputs right over time because you may want to kind of maybe iterate along with the environment as things progress. So I don't know, in the engineering sense, I imagine you can, you know, you can use different models for different bits and it's kind of question of just, I mean, building the system in which you can do it on a more fundamental way. Can you glue BDH with a transformer and have kind of one model that somehow works together? Because I mean this is, this is what some people who modify attention do. This is not the case. This is not the case. So this is actually either transformer or bdh because it's so different, so fundamentally different. It's not let's say a plug into a transformer or a plugin. Then there is however another just, I mean anchoring maybe on the merging kind of tape of gluing angle because we have just this dimension N like just the neurons. Right. It's a graph and it's a distributed system. You, you can totally see that you're taking those brains and you kind of train them separately and you do them together and then you maybe have a couple of runs of training together such that they form connections between them and then becomes even stronger. And in this sense you can get, you know, not a separate mathematician and a separate separately trained computer scientist, but all of a sudden you actually have a mathematician and computer scientist who combines the intuitions of the two disciplines in one. Or you know, this can work for finance and law.
[63:19]
B
You're talking about training separate graphs and then merging them, is that right?
[63:26]
A
Yeah. Then gluing them and we glue them along this one dimension N. We actually show a small experiment in the paper for this where we actually didn't do the kind of the round of training when they were glued together, but we actually on language we do. We're going to glue the two, the two models trained into two different languages and then all of a sudden they start to speak some sort of like Creole. I mean like a language that makes sense but mixes the words of the two languages. Point is, it's very simple with this architecture because effectively you look at the graphs and you just put them together so it feels a bit like composable programs.
[64:09]
B
Yeah. And it sounds again, Sakana AI came up with this, I think they call it model merge where they can put two models together and that's part of their evolutionary strategy that they, they'll have a bunch of models, models, you know, generate output. They, they pick the best models and merge them into. Presumably you could do that as well with, with BD8.
[64:46]
A
I mean, I, I, presumably you could, I mean I haven't explored it in, in full honesty, but yes, you could technically because you can, you can merge them and if, if you want to train evolution, like do some sort of kind of evolutionary pruning of what's your best model, then then probably you could do it. I, I don't know if it has a purpose in our case, but technically,
[65:14]
B
yes, the merging is, that's all fascinating Susanna, really. And yeah, because transformers it seems have kind of hit a plateau or are hitting a plateau. So it's fascinating to see new architectures emerging or new strategies emerging. Yeah. Okay, well, let's leave it there. Is there anything I didn't ask that you'd like listeners to know maybe where they can go and find, find BDH or.
[65:54]
A
Yeah, yeah. Thank you, thank you so much for this conversation. It was great and I mean these were pretty deep questions. I really appreciate it. So I mean you can, you can reach out to us. I mean for me LinkedIn is probably the easiest also. Researchpathway.com this is for all the kind of research types of questions. I mean we're very, very happy to answer there. Please note that there is. The paper is available on GitHub to digest it if you. It's pretty long. I mean Craig, I think you've got it. It's very deep. It's very dense and very long because it has to call it covers a lot of intuitions, actually a lot of proofs as well because we're showing the link between Transformer and the brain kind of that goes through expressivity. So all of this goes again pretty deep. Computer science, theoretical computer science, that is not always the thing that being the most, let's say studied in the AI community. So I can strongly encourage something that we've seen where working a lot very well with, with many people actually by now LLMs are doing a phenomenal job of this paper especially, especially the reasoning that, the most advanced reasoning ones, once we ask them to properly read the paper, you know, with the proofs and then, and then break it down, they really do an amazing job. So I would strongly, strongly encourage you to use AI to kind of read about AI.
[67:22]
B
Yeah, I read it on Archive. It's the same paper, right?
[67:29]
A
It's the same paper. And the paper, the paper is publicly available, of course. And then, I mean, there is also, there are also other podcasts, you know, that are way more technical with other members of our team. So I would also strongly invite you to listen to this if you have questions. This one was great. I mean, thank you. Thank you so much.