Summary5 min read

Next in AI: Your Daily News Podcast

Episode Title: Brain Surgery for LLMs: Scaling Transformers with Embedding Modules
Date: January 21, 2026
Host: Next in AI

Episode Overview

This episode explores a major breakthrough in large language model (LLM) architecture introduced in the paper "Scaling Transformers with Embedding Modules (STEM)." The discussion centers on the limitations of current scaling approaches in AI, specifically the costs and complexities of continuously increasing model size, and introduces STEM as a radical redesign: effectively removing some of the most computationally expensive model components and substituting them with a simpler, more editable, and efficient memory mechanism. The hosts dive into the technical details, the practical implications, and the profound potential — and risks — of this new architecture.

Key Discussion Points and Insights

The Scaling Wall in AI

Logistical Nightmare of Larger Models
- The past playbook: "Make it bigger, add more parameters, throw more data at it, buy more GPUs."
  [00:17–00:35]
- Diminishing returns: Models become slower, costlier, and harder to train.
- Quote:
  
  “It’s like trying to build a skyscraper on a foundation meant for a two story house. Eventually the physics just starts working against you.”
  (Speaker B, 00:42)

The Problem with Current Architectures (FFN and MOE)

Feed Forward Networks (FFN):
- Where models store "facts."
- Every token activates all neurons, making retrieval inefficient.
  [02:14–02:59]
- Quote:
  
  “If I ask for the capital of Spain, I don’t need the model to activate the neurons related to quantum physics or baking recipes.”
  (Speaker A, 02:59)
Mixture of Experts (MOE):
- Only activates relevant "experts" for a specific task, which theoretically increases efficiency.
- Analogy: A corporate committee with a router (middle manager) handling requests.
- Issues:
  - Training instability (“lost spikes”), routing complexity, latency from data shuffling across hardware, and management overhead that “eats up the savings.”
    [03:44–05:05]
- Quote:
  
  “It’s like hiring a middle manager who spends all day organizing meetings instead of letting people work.”
  (Speaker A, 04:55)

The RADICAL Approach: STEM (Scaling Transformers with Embedding Modules)

The Core Idea:
- Eliminate expensive lookup calculations in FFN by assigning each fact/token a fixed vector in a lookup table.
- Analogy:
  
  “The STEM way is I just have home saved in my project. I click it and I’m there. No math required.”
  (Speaker A, 06:38)
Technological Details:
- All parameters for a token are pre-indexed (“token indexed”), enabling direct access — static, no calculation/routing overhead.
  [06:13–07:03]
- Key Value Theory:
  - STEM capitalizes on the latent key-value structure of FFNs.
  - Arrows (vectors) for each fact/token become orthogonal — meaning they point in completely different directions, vastly reducing overlapping/interference between concepts.
Efficiency and Scaling Advantages:
- Parameter Reduction:
  - Cutting the “up projection” matrix eliminates 30% of the parameters in affected layers.
    [08:38–08:49]
  - Quote:
    
    “30% is not a rounding error. That’s significant.”
    (Speaker A, 08:49)
- Memory Management:
  - Vectors/tables pushed to cheap, abundant CPU RAM, not expensive, limited GPU memory.
  - Prefetching: CPU is cued in advance to deliver the data just in time, “hiding the latency.”
    [09:14–09:58]
  - Quote:
    
    “By the time the GPU actually needs to do the math, the data has already arrived from the CPU. They call this hiding the latency.”
    (Speaker B, 09:50)
- Performance Metrics:
  - On 1B-parameter models: 20-25% reduction in compute steps (FLOPs); 3-4% improvement in accuracy.
  - “Accuracy up, compute down. That’s the holy grail.”
    (Speaker A, 10:19–10:22)

Editing Model “Facts” — Brain Surgery for LLMs

Interpretability:
- Each fact is now at a physical address in the model.
  
  “In stem, Spain is in row four, slot B. It has an address.”
  (Speaker A and B, 11:03–11:04)
Direct Fact Swapping:
- Example: With a prompt of "Country: Spain; Capital: ___", researchers swap Spain’s vector with Germany’s in the lookup table, and the model outputs “Berlin.”
  [11:12–11:38]
- For multi-token concepts (e.g., “Czech Republic”), averaging the vectors serves as a “semantic smoothie.”
  
  “You put that blended vector into the USA slot, and the model started talking about Prague.”
  (Speaker B, 12:12–12:15)
Implications for Model Correction:
- Instantly patch hallucinated or erroneous facts by overwriting vectors without expensive retraining.
  
  “We just find the flat earth vector and overwrite it with the round earth vector. It’s a patch. A software update for facts.”
  (Speaker A and B, 12:33–12:40)

Real-World Performance and Long Contexts

Benchmarks:
- Outperforms standard “dense” models in reasoning-heavy tests (ARC Challenge, OpenBook QA, GSM8K).
  [12:47–13:12]
- Superior at “needle in a haystack” retrieval as documents grow longer — instead of performance drops, STEM’s advantage widens.
  
  “When a conversation gets long… STEM naturally scales its capacity. Every unique word you add activates its own distinct parameters. So the more vocabulary you use, the more brain power the model effectively brings online.”
  (Speaker B, 13:36–14:00)

Notable Quotes and Memorable Moments

On the philosophy of the method:

“How do you make a model smarter? By deleting its math.”
(Speaker A, 01:59)
On interpretability and risk:

“STEM turns the lights on, it makes the box transparent. But once you can see the wiring, you can cut the wiring.”
(Speaker A, 14:27–14:38) “When you make the brain easier to heal, you make it easier to hack.”
(Speaker A, 15:19) “Interpretable AI is powerful, but it requires a whole new layer of security we haven’t even built yet.”
(Speaker B, 15:22)
On potential for both safety and vulnerability:

“On one hand, we can surgically remove bias. ... But the flip side, imagine a bad actor ... They don’t need to poison the training data ... They just need to open the lookup table and and swap the vector for safe with toxic.”
(Speaker B and A, 14:38–15:08)

Timestamps of Key Segments

00:00–01:16 — The “scaling wall” and why brute force is dying
02:14–03:44 — FFN and MOE described; how models store and retrieve facts
05:09–07:03 — Introduction to the STEM architecture and the radical lookup table idea
07:17–08:27 — Key-value memory theory and orthogonality in embeddings
08:38–10:29 — Efficiency, memory management, and real-world hardware advantages
10:29–12:40 — Editing facts: addressable, swappable memory; implications for error patching
12:47–14:02 — Performance benchmarks and handling long context windows
14:20–15:22 — Double-edged sword: interpretability and attack vectors
15:41 — Reference to visual diagram in show notes (Spain/Germany swap)

Conclusion

This episode highlights STEM’s transformative approach to LLM architecture: simplifying and making model memory editable, dramatically improving efficiency and interpretability, and raising both the potential for safer, more transparent systems and the challenge of new security vulnerabilities. The podcast wraps with a call to view the Spain/Germany vector swap diagram and reflects on the new era of powerful, but hackable, interpretable AI.

Loading summary

Transcript128 lines

[00:00]
A
You know, there's this feeling right now in AI, if you talk to the engineers off the record in the break room when the bosses aren't listening, that we might be hitting a wall. Or if not a wall, a very, very steep hill.
[00:15]
B
The famous scaling wall. The elephant in the server room.
[00:17]
A
Exactly. The scaling wall. For the last few years, the playbook has been incredibly simple, almost boringly simple. You want a smarter model. Make it bigger, add more parameters, throw more data at it, buy more GPUs. And for a long time, that brute force approach worked wonders.
[00:36]
B
It did.
[00:36]
A
But we are reaching a point where bigger is becoming, well, a logistical nightmare.
[00:42]
B
It's the law of diminishing returns hitting us hard. You can make models bigger. Sure, but they become incredibly slow, they cost a literal fortune to run, and honestly, they become unwieldy to train. Yeah, it's like trying to build a skyscraper on a foundation meant for a two story house. Eventually the physics just starts working against you.
[01:00]
A
And that's why the paper we are diving into today is so exciting. It's titled Scaling Transformers with Embedding Modules. And it basically asks a heretical what if we stop making the math harder? What if we actually make the brain of the AI simpler?
[01:16]
B
It really is a less is more moment. This paper isn't just a tweak or a minor optimization. It's proposing that we rip out one of the most computationally expensive parts of a large language model.
[01:27]
A
The part that stores the facts the
[01:29]
B
way it stores facts, and replace it with something that looks suspiciously like a simple lookup table.
[01:34]
A
It feels like brain surgery. Yeah, they're showing that by changing how the model accesses memory, you can literally swap out as thoughts in real time. Yeah, we're talking about making an AI. I believe the capital of Spain is Berlin. Without changing a single word of the input text.
[01:50]
B
It is a wild paper. It touches on efficiency, interpretability and the fundamental architecture of how these things think.
[01:57]
A
Think.
[01:58]
B
It completely challenges the status quo.
[02:00]
A
So let's get into it. The mission today is to decode stem. How do you make a model smarter? By deleting its math. But before we get to the solution, we have to understand the problem. We mentioned that models store facts in something called the Feed Forward Network, or ffn.
[02:15]
B
Right. So if you visualize a transformer model, which is what GPT, Claude and Llama all are, you have the attention layers.
[02:22]
A
That's the famous part.
[02:23]
B
That's the famous part. Attention understands context. It figures out that if I say I deposited money at the bank, bank means A financial institution, not a river edge. But then you have the FFN layers. This is the knowledge bank.
[02:38]
A
So attention is the logic. FFN is the encyclopedia.
[02:42]
B
Roughly, yes. The FFN is where the model remembers that Paris is in France, that the speed of light is constant, or you know, who won the 1998 World Cup. Okay, now, in a traditional dense model, every time the AI looks at a word, it has to scan that entire encycloped. It activates every single neuron in that layer.
[02:59]
A
Which sounds exhausting and incredibly inefficient. If I ask for the capital of Spain, I don't need the model to activate the neurons related to quantum physics or baking recipes.
[03:08]
B
Exactly. It's a waste of energy. So the industry pivoted to mixture of experts or moe.
[03:14]
A
The logic here is, why activate the whole brain? If the topic is biology, let's just activate the biology neurons.
[03:24]
B
I've always liked the corporate committee analogy for this. Instead of one genius doing everything, you have a bunch of specialists. You have a geography expert, a coding
[03:31]
A
expert, a history expert, and you have a router. The router is like the traffic cop or the receptionist at the front desk. A word comes in, the router looks at it and says, okay, this is a question about photosynthesis. Send it to the biology department.
[03:44]
B
It sounds perfect. On paper, you get the capacity of a massive brain. But for any specific task, you only use the compute of a small brain. So where does it fall apart? Why are we looking for a new solution?
[03:55]
A
It falls apart because managing that committee is a nightmare. First, training. These MOE models is notoriously unstable. They suffer from lost spikes.
[04:03]
B
Lost spikes?
[04:04]
A
Imagine you're training the model. Everything is going well. The learning curve is smooth, and suddenly, boom. The model gets confused. The error rate shoots up, and you have to restart or roll back.
[04:14]
B
It's like the router suddenly forgot who works in which department and started sending the lunch orders to the accounting team.
[04:20]
A
That sounds expensive.
[04:21]
B
It is. But the bigger issue is the traffic jam. In a massive data center, these experts might live on different GPUs or even physically different servers.
[04:31]
A
Right.
[04:32]
B
When the router says, send this to expert B, that data physically has to travel across cables to get there.
[04:38]
A
Latency.
[04:39]
B
Massive latency. You're spending time moving data around instead of calculating. And the router itself. It has to do math to make that decision.
[04:48]
A
It's not free.
[04:48]
B
It is not free. It has to calculate probabilities for every single token to decide, decide where it goes. It's a bottleneck.
[04:56]
A
So we built a system to be Efficient, but the management overhead. The router is eating up the savings. It's like hiring a middle manager who spends all day organizing meetings instead of letting people work.
[05:06]
B
That is a very, very accurate description of the MOE problem.
[05:10]
A
Okay, so MOE is the current state of the art, but it's messy. Enter STEM scaling transformers with embedding modules. What are they doing differently to fix this traffic jam?
[05:20]
B
They are attacking the FFN layer. Specifically, they are looking at the math used to access information in a standard model. Accessing a fact usually involves three big matrix multiplications. Up gate and down.
[05:34]
A
Up gate, down. Let's focus on the up part.
[05:37]
B
The UP projection is essentially the model trying to figure out where to look for information. It takes the word say Spain and runs a complex calculation to map it into a higher dimensional space to find the relevant neurons.
[05:50]
A
So it's doing a calculation to find an address?
[05:52]
B
Exactly. It's calculating the address on the fly. Every single time Stem says stop doing that, they delete the UP projection entirely.
[05:59]
A
Wait, if you delete the search mechanism, how does the model find anything that feels like lobotomizing the AI? You can't just delete a matrix and expect it to work.
[06:07]
B
It would be a lobotomy unless you replace it with something smarter. STEM replaces that calculation with a simple lookup table.
[06:13]
A
A lookup table, like an Excel spreadsheet?
[06:16]
B
Basically, yes. It is token indexed. The model has a pre made list. It knows if the input token ID is 4021, which is the code for Spain. The information I need is located at this specific vector.
[06:30]
A
Okay, I think I get it. The old way is like me giving you a coordinate, latitude and longitude. You have to do trigonometry to figure out it points to your house.
[06:36]
B
Right? You have to do the math every time you want to go home.
[06:39]
A
The STEM way is I just have home saved in my project. I click it and I'm there. No math required.
[06:46]
B
That is a fantastic analogy. It's the difference between computing a location and retrieving a location. Because it's tied to the token id, it is static stack. There is no router guessing where to go. The model knows exactly where the SPAIN information lives before it even starts the layer.
[07:01]
A
So the middle manager is fired, the router is gone.
[07:04]
B
The router is gone, the traffic jam is gone.
[07:06]
A
But why does this actually work? Neural networks are supposed to be fluid, right? They learn connections. Hard coding Spain to a specific vector sounds rigid. Are we losing the nuance?
[07:17]
B
This brings us to the key value theory of memory. Researchers have suspected for a while that FFNs act like key value pairs. The up projection creates the key, the address, and the down projection holds the value. The fact. Okay, STEM just makes the key permanent. And here's the cool part. The geometry.
[07:36]
A
Oh, boy. Geometry. Keep it simple for me.
[07:39]
B
They analyze the angular spread of these STEM embeddings. Imagine all the concepts the model knows are arrows pointing out from a center. In a standard model, the arrow for Spain and the arrow for Portugal might point in almost the same direction.
[07:54]
A
They overlap, which creates interference. The model may get them mixed up because they look similar mathematically.
[07:59]
B
Exactly. But with stem, because every word has its own dedicated lookup slide, the researchers found that the arrows are orthogonal.
[08:07]
A
Orthogonal?
[08:07]
B
It means they point in wild, totally different directions. The cosine similarity is almost zero.
[08:12]
A
So Spain is pointing north, Portugal's pointing east. No interference.
[08:17]
B
Precisely. It's a cleaner filing system. The model can retrieve specific facts about Spain without accidentally pulling in facts about Portugal or France.
[08:27]
A
So the filing system is cleaner. But let's go back to that scaling wall we talked about at the start. Does this actually solve the hardware problem, or is it just a neat theoretical trick?
[08:38]
B
It solves it in a way that system architects are going to love. First, by cutting that up projection matrix, you remove about 30% of the parameters in those layers immediately. A third, you're just carrying less weight.
[08:49]
A
30% is not a rounding error. That's significant.
[08:52]
B
But the real magic is in the memory management. Remember how we said the lookup is static? We know Spain is always Spain, right?
[08:59]
A
It's a fixed address.
[09:00]
B
That means we don't need to keep those massive embedding tables on the gpu. GPU memory HBM is the most expensive real estate in the world right now. It is scarce, it is pricey, and it is the main limit on how big models can get.
[09:14]
A
So where do we put the tables?
[09:16]
B
We kick them to the cpu, regular system ram. It's cheap, it's abundant, and you probably have terabytes of it sitting around.
[09:22]
A
But wait, CPU memory is slow. If the GPU is a Ferrari, the CPU is a minivan. Won't that slow the whole model down if we have to fetch data from the slow la?
[09:32]
B
It would if we didn't cheat. The paper describes a technique called prefetching. Prefetching because the model can see the text coming in, it sees the world. Spain. In the prompt, it can send a Command to the CPU. Hey, I see Spain coming up in 5 milliseconds. Go get the vector.
[09:48]
A
Now it's preloading the Luggage before the plane even lands.
[09:51]
B
Exactly. By the time the GPU actually needs to do the math, the data has already arrived from the cpu. They call this hiding the latency.
[09:59]
A
Right.
[09:59]
B
You get the capacity of a massive model, but you only need enough GPU memory for the active working set.
[10:05]
A
That is incredibly clever. It decouples the brain size for the
[10:08]
B
GPU limits, and the results back it up. On 1 billion parameter models, they saw a 20 to 25% reduction in Flop's computational steps, while accuracy actually went up by 3 to 4%.
[10:20]
A
Accuracy up, compute down. That's the holy grail.
[10:23]
B
It's rare to see both move in the right direction. Usually efficiency comes at the cost of intelligence. Here, they seem to get both.
[10:29]
A
Okay, efficiency is great, but I promised the listeners brain surgery at the top of the show. And this is where the paper goes from smart engineering to mind bending.
[10:39]
B
This is the interpretability section. And honestly, this is the part that gave me goosebumps. Because STEM assigns a specific physical vector to a specific word, we can, for the first time, point to exactly where a fact lives.
[10:53]
A
In a normal model, knowledge is smeared everywhere. Right. You can't point to the Spain neuron.
[10:58]
B
Right. It's holographic. It's everywhere and nowhere. But in stem, Spain is in row four, slot B.
[11:04]
A
It has an address.
[11:05]
B
It has an address. So the researchers decided to play Dr. Frankenstein. They took a prompt. Country, Spain. Capital.
[11:11]
A
Which should produce Madrid.
[11:13]
B
Correct. But before the model could answer, they reached into that lookup table. They took the vector for Spain and physically swapped it with the vector for Germany.
[11:22]
A
To be clear, the text on the screen still reads Spain.
[11:24]
B
The input text is Spain, but the model's internal representation, the thought it triggers, is now Germany. And the model output Berlin. It started describing the Reichstag. It talked about the Brandenburg Gate.
[11:39]
A
That is unbelievable. It's like the model is hallucinating, but it's a controlled hallucination.
[11:44]
B
We planted the thought, they pushed it even harder. Swapping Spain and Germany is easy. They are both single words. But what if you want to swap USA with Czech Republic? Czech Republic is two tokens.
[11:57]
A
Yeah. You can't fit two pegs into one hole. How does that work?
[12:00]
B
You'd think it would break, but they found that if you take the vector for Czech and the vector for Republic and just average them, just mathematically blend them together, like making a smoothie, a semantic smoothie. You put that blended vector into the USA slot, and the model started talking about Prague.
[12:15]
A
That implies that the meaning of Czech Republic is literally just the average of those two words, that feels almost too simple.
[12:22]
B
It validates a lot of theory about vector spaces. But practically, think about what this means if a model hallucinates. If it thinks the earth is flat, we don't have to spend $10 million retraining it.
[12:34]
A
We just find the flat earth vector and overwrite it with the round earth vector.
[12:37]
B
It's a patch. A software update for facts.
[12:40]
A
That sounds incredible. But does this simple brain actually handle complex tasks or is it just good at geography quizzes?
[12:48]
B
That was my worry too. But they tested it on knowledge heavy benchmarks like ARC Challenge and OpenBook QA. These are tests that require reasoning and facts. STEM consistently beat the dense baselines in math. It showed gains on GSM8K, which is a standard math reasoning benchmark. This suggests that even though we are simplifying the memory access, we aren't hurting the reasoning capabilities.
[13:12]
A
There's also a note about long context. Everyone's obsessed with context windows right now, feeding the model entire books or code bases. How does STEM handle that?
[13:21]
B
This is the needle in a haystack test. You hide a specific fact in a massive amount of text, say 32,000 tokens, and ask the model to find it. As the documents got longer, the performance gap between STEM and the standard models actually widened. STEM got better by comparison.
[13:36]
A
Why is that?
[13:37]
B
Because of that static sparsity we talked about. When a conversation gets long, a standard MOE model might get saturated. The router gets overwhelmed trying to juggle all the experts. But STEM naturally scales its capacity. Every unique word you add activates its own distinct parameters. So the more vocabulary you use, the more brain power the model effectively brings online.
[14:00]
A
It doesn't get tired, it just opens more drawers.
[14:02]
B
Exactly. It scales naturally with the complexity of the input.
[14:06]
A
So let's zoom out. We're looking at a new architecture that replaces complex matrix math with smart lookups. Yeah, it makes models faster because we can offload memory to the cpu. Yeah, it makes them more stable because we fire the router and it makes them editable.
[14:21]
B
It really is a comprehensive rethinking of the subsystem.
[14:24]
A
It's a patch, but it's also. Well, it's a little scary. Isn't it?
[14:27]
B
Awesome.
[14:28]
A
Well, we spent years complaining that AI is a black box. We don't know how it works. STEM turns the lights on, it makes the box transparent. But once you can see the wiring, you can cut the wiring.
[14:39]
B
That is the double edged sword. On one hand, we can surgically remove bias. If a model has a harmful association with a word we can theoretically find that vector and neutralize it.
[14:49]
A
Right?
[14:50]
B
We can fix it without destroying the model's language skills. It allows for safety patches.
[14:55]
A
But the flip side, imagine a bad actor downloading an open source model. They don't need to poison the training data, which is hard.
[15:03]
B
Right?
[15:03]
A
They just need to open the lookup table and and swap the vector for safe with toxic.
[15:08]
B
You could make a medical AI that recommends poison and looking at the code, looking at the inputs. You'd never know until it was too late. The vulnerability is the same as the strength accessibility.
[15:19]
A
When you make the brain easier to heal, you make it easier to hack.
[15:23]
B
That is the reality we are moving into. Interpretable AI is powerful, but it requires a whole new layer of security we haven't even built yet.
[15:31]
A
Something to think about next time you ask your chatbot for advice. Is it accessing the right vector or did someone swap the file?
[15:40]
B
Let's hope it's the right vector.
[15:41]
A
If you want to see that diagram of the Spain, Germany swap, it really is worth a look. Just to visualize the brain surgery. Check the show notes. We've linked the paper there. Thanks for diving deep with us.
[15:50]
B
See you next time.