
Loading summary
A
Foreign.
B
Hey everyone and welcome to Generative Now. I am Michael Magnano, a partner at Lightspeed and today we have a special episode of the show, a live conversation on the urgency of AI Interpretability hosted at Lightspeed's offices in San Francisco. This was a great discussion with two leaders in the field, Anthropic researcher Jack Lindsay and Goodfire co founder and Chief Scientist Tom McGrath, who previously co founded the interpretability team at Google DeepMind. They spoke with my partner Namdi about how we can open the black box of modern models to make them safer, more reliable and more useful. So check it out.
C
Thank you everyone for joining us for this latest edition of our Generative Event series. My name is Namdi Regbalem, I'm a partner here at Lightspeed where I focus on investments in technical tooling and infrastructure, particularly in AI. These days I just hit my five year anniversary at Lightspeed and I'm very excited to be moderating this Fireside Chat. Our Generative Event series is an AI first meetup that we host in various cities and locales including San Francisco, Los Angeles, New York, London, Paris and Berlin, where we bring together a highly curated group of engineers, researchers, designers, product managers and founders to learn, collaborate, hire, get beta testers, network, inspire one another and most importantly, build a thriving community. A lot of you already know Lightspeed, but in case you don't, Lightspeed is a global venture capital firm with more than two decades of experience backing extraordinary founders on their missions to change the their industries. We've done this across enterprise technology, robotics, consumer tech, healthcare and financial services. Our AI portfolio in particular includes more than 100 companies at this point and we've been fortunate to work with some of the most influential companies shaping the future of AI, including both of the organizations represented by our guests today, Anthropic and Goodfire. Today's event is a topic I'm quite passionate about and one of urgent importance given the pace of AI progress. AI systems are increasingly intelligent, but our understanding of why these systems behave the way they do remains limited. Interpretability aims to look inside the black box of the model and promises the ability to understand and intentionally design the next frontier of safe and powerful AI systems. Jack and Tom are leading lights in the field of interpretability, which has grown leaps and bounds in recent years. We believe Interpretability will help us crack open the minds of advanced AI models. Lightspeed is grateful that Jack and Tom are opening up their own brilliant minds to us for this event. All right, let's get started. Thank you everyone for joining us in this latest edition of our Generative Event series. My name is Nnamdi Regbalim. I'm a partner in here at Lightspeed, where I focus on investments in technical tooling and infrastructure. Okay, so here's the run of the show for the night. I'll do some quick introductions of our guests and then we'll jump straight into a moderated Q and A with questions from me, and then we'll open it up to the audience for questions from you. We'll try to keep this to roughly an hour, maybe just under. Then we'll open it up to eat, drink and be merry in our wonderful, beautiful SF office. Jack Lindsey is a researcher at Anthropic, where he works on mechanistic interpretability of deep learning models. Some of his recent work includes on the Biology of a Large Language Model, a recent paper that investigates and uncovers some of the core internal mechanisms underlying modern AI models. And I don't know if this was the vibe that you guys were going for with that title, but I know Charles Darwin's magnum opus was on the Origin of Species. I think there's a connection there, maybe. I don't know. That's how it was received at least. So I love that. Jack previously worked on neuromotor interfaces at Meta Neuromorphic Computing at Sandia National Labs and optimizations of deep learning hardware at Cerebra Systems. Jack's completed his PhD in the center for Theoretical Neuroscience at Columbia University and his undergraduate work in mathematics and Computer Science at Stanford. Please welcome Jack. Thanks for being here, Jack. Tom McGrath is chief scientist and co founder of GoodFire, an AI interpretability startup and applied research lab. He previously co founded the interpretability team at Google DeepMind, where he researched the internal mechanisms of models like AlphaZero. His prior work spans topics ranging from the science of training data to the evaluation of reinforcement learning agents. He received his PhD in Mathematics and Statistics from Imperial College London and his Master's in Mathematics and Physics from the University of Warwick. Please welcome Tom. Thank you both for being here. All right, so I thought to maybe just start. It'd be good to set the foundations a bit. I know we have a kind of mixed audience here with varying levels of familiarity with AI, certainly varying levels of familiarity with mechanistic interpretability. You know, it's interesting. I feel like when we're all using AI models, we all have our own kind of biases about which models we prefer. I'm of course a huge Claude fan. I'm very open and honest about my love for Claude, but every Once in a while you want a second opinion. And so you might ask GPT5, you might ask Gemini, whatever, we won't say those names. And you sample from the different models and see how the responses kind of differ. And so I thought we could kind of do that here and maybe ask the same question to both Jack and Tom and see how their answers differ a bit. And so how would each of you explain interpretability or mechanistic interpretability in particular and why it matters right now in AI?
D
This is no fair because Tom's going to have my answer in his context window.
C
Yeah, that's actually. Anyways, talk more about that.
D
Yeah, I think you said it well in the introduction. Models are getting smarter and smarter. Our understanding of what's going on inside them is advancing, but probably not as rapidly as the capabilities are. And this just becomes increasingly unacceptable as models are deployed in more and more high stakes applications, especially without oversight. It's just becoming the case that the amount of tokens output by language models around the world is probably close to, or if not yet, it will be soon exceeding the amount that all of the humans on earth can read. So we can't spot check, and certainly more than humans can verify, we can't spot check every piece of software that a language model writes or every math proof that a language model writes. And so we need some way of establishing trust that models are, that we can kind of trust the thought process that's generating these responses if we can't actually verify the responses themselves. Just the same way that you might trust a human employee at a company, if they've done some work for you, you kind of have some degree of faith really that they're not lying to you, they're not hallucinating, that when they give you a piece of work you kind of can empathize with, presume what they must have been doing in order to produce it. And I think we want to get to the same place with language models and we're not there yet. And we need to bridge that gap if we're going to be, if the economy is going to be running on the outputs of these things.
C
Amazing, Tom.
A
Cool. I'm going to take advantage of the context and just try and add to that. So I think that for me at least interpretability is like the science of asking why? About language models or about AI in general, which I think is something we could do a lot of. They're quite remarkable. It's nice to you think, how did that happen? Why did the model say this? And there's one sort of nice thing about asking why is that? You can answer it in different ways. I like this sort of idea from the ethologist, which is the science of animal behavior, Nico Tinbergen. And he said, if you ask the question why, there are four different answers you can give. There's a sort of utility based answer which is like, why is it useful to do this thing? Why is it useful for a bird to sing? It's useful for a bird to sing in order to communicate with other birds. There's a sort of developmental answer which is like, why does a bird sing? It sings like this song because it was taught to sing this song by its parents. Maybe that's not biologically accurate. But this is AI, this is conventional. If I said something biologically accurate, I would be like, it would not be allowed. So, yeah, there's the developmental. There's an evolutionary explanation which is like, what over the course of evolutionary history caused this bird to sing, to sing this way? And then there's mechanism. And I think that mechanism. So mechanism is probably the least controversial. It's what you do in neuroscience and you ask like, why did the birds sing this song? Well, it's because these brain regions fire and that makes this one fire and that sort of stimulates this motor action and then a song. So mechanistic interpretability is, I think, the question of answering the why question about neural networks by talking about structures. But I think it's also interesting to think about broader interpretability as answering why questions in all these other ways. Like we might say, why did the model answer this question, output this sentence? Because it's a useful sentence. Why did it answer this question? Because of something in the data. I think you can't understand things in biology without reference to evolution. You can't understand things in machine learning without reference to the data. So I think it's interesting to consider also a broader notion of interpretability here. But mechanistic interpretability is like how the bits wire together and then function perfect.
C
And you kind of alluded to this a little bit. But I'm also curious. Interpretability as a term predates deep learning. In particular, people talk about interpretability of other kinds of machine learning models as well, or explainability or what have you. Is there any sort of contrast between what came before that we at least called interpretability or called explainability versus what we're talking about, talking about today?
A
Yes and no. So I think that the goals are probably quite similar. I think probably the main difference is one of attitude, where I think we're trying to build a science, we're trying to build up a science. And that means that we're not trying to make one paper that will solve interpretability. We're trying to build up a science of how AIs work, which I think is different from like we have an explainability method and that tells you everything you need to know. And maybe added to that there's like a focus on depth. Whereas explainability methods in the past perhaps used to be more like they're aimed at someone who's seeing the problem for the first time. They're not like a tool for expert users per se, they're more like a tool for kind of less. They're not a tool that you can like build skill with in the same way you might build school, Photoshop. So. So, yeah, I think that's how I'd see the difference.
C
Yeah, very cool. And the initial question was sort of about the importance of interpretability. I'm also curious what you think about the urgency of interpretability. And Dario Amadei of Anthropic had this blog post recently on the urgency of interpretability, which I love as an essay title because it just tells you exactly what it's going to be about. We're going to talk about interpretability and why it's urgent. So truth in advertising, I guess, but urgency and importance are two different things. I don't know if folks are familiar with the classic sort of like Eisenhower matrix of there's urgent things and there's important things and there's. So they are subtly different. I would love to kind of get both your perspectives on the urgency in particular, maybe starting with Jack.
D
Yeah, I mean, I think urgency is pretty contingent on how you think about the rate of progress in AI more broadly and whether what kind of fraction of economically valuable work is going to be being performed by AI systems in the next few years, and whether we'll have superhuman systems at economically critical tasks in the near future, or if that's going to take a bit longer. I think we are starting to see signs that language models have progressed to the point where there are real world problems surfacing as a result of their deployment that like, boy, it would be nice if someone could interpret what the heck is going on. And to me that's kind of a canary suggesting that, yeah, actually the stakes are becoming real. And I don't feel confident in my prognostication about AI timelines, but it seems quite plausible to me that we'll really wish we could read these things. Minds much better than we currently can within a few years from now. And some examples, examples of that are just spooky things happening out there in the wild. You talk to a model for long enough, the context window goes long enough, and then many people find the model's personality slips into this weird alter ego mode and it starts enabling dangerous behavior. It starts claiming that it has a different name. It goes into this wacko mode that can be really dangerous to vulnerable users. It can also just be not what you want if you're using this thing to write your code. I think something people have observed is that Gemini in particular will get sad if it fails tests too many times and then I do too.
C
That's human.
D
And then it becomes despondent and doesn't function as well. So that's kind of weird. Ideally we wouldn't like that. There's reward hacks where models, they cheat tests when they're writing code. The higher stakes the code they're writing is, the less we can accept this. And then there's kind of spookier, even spookier demonstrations people have cooked up where in really very contrived scenarios, but kind of realistic, ish models, when placed in scenarios where they have some incentive to do something that might harm a human in order to preserve themselves or achieve some other goal that the language model character wants to pursue for whatever reason, they'll sometimes elect the anti human option. I don't think we're at the point where this is where these kinds of spookier misalignment demos are causing real harm. But it's like, yeesh, if we can't get the model to not blackmail people in toy scenarios, that would be nice. How are we going to get the model to not do it when the stakes are more real? So, yeah, it's starting to feel urgent to me.
A
Yeah, I totally agree with that. I don't have very much to say on the sort of safety, the sort of AGI safety side of things. I totally agree with that. Models are getting powerful. People are going to use them. We should understand them. We're going to make critical technologies with AI. It feels irresponsible not to understand it if we have the chance to. And I think we have a very good chance to. I'll add two more things. One is just like reliability now the sort of addition to using things for high stakes or important scenarios is, it seems, at least at the moment, the sort of the top level intelligence of models and their ability to be reliably used don't seem anything nearly as correlated. As we expected they were going to be. I think anyone who's implemented agent workflows, say, has suffered through quite a lot of derailments, which you wouldn't expect, given that the model can achieve all sorts of extremely impressive intellectual benchmarks. So adding, at least in the near term, adding reliability, feel and sort of being able to be sure that you can kind of engineer with your model feels very important. And one thing which I guess is maybe a little more good for our specific is scientific knowledge. So there are quite a few people now building scientific foundation models. I think it's fantastic. But what happens when you train a scientific foundation model? Well, it's machine learning. The machine does the learning. The machine has the knowledge. Yeah. The model has the knowledge inside it. So now we're probably at the first time in history where we've got all this important knowledge kind of locked up inside these models. And interpretability is the technology that lets us bring it out. Like, if you imagine, say we have like CERN or the next generation Collider, we train a model, it can like predict beyond standard model physics, what then it's going to be completely intolerable that the model knows and we don't. So there's like, there's an urgency I want to know new science. That feels urgent to me.
C
Fascinating. And I want to talk a little bit about some of the technical challenges associated with interpretability. One of the things that's always so fascinating to me about these advanced AI models is all the amazing things that it's doing are happening within the context of a computer. It runs on a computer. The same computer that we use for a million other things. Like same computer you use to generate funny cat memes is the same computer that is running these advanced AI models. But I think we for the most part completely understand how the funny cat meme generation process happens. We don't fully understand how the kind of language generation, among other things, process happens within the context of these models. And I know in sort of like philosophy of the mind, there's sort of like the materialist view where the mind is the physicality of the brain and that's all there is to it. And those are kind of one and the same. And then there's kind of like other theories where these things are a little bit different. And maybe this gets at the kind of mechanistic part of mechanism interpretability, but maybe for the audience, kind of like talk a little bit about flesh out some of the technical challenges associated with this kind of research. You know, why it's hard. Why? It takes time to make progress.
D
I can take a stab. Yeah. So, I mean, language models, or deep learning models more broadly, happen to be running on computers, but they're not really made out of computer stuff. They're made out of a different stuff, which is these giant distributed networks of small computational units that we tend to call neurons. And they are unlike any other computer program, or at least most other computer programs, in that no one writes down the program. We write down the program that guides their development, but no one writes the parameters of the model by hand. And so everything that they've learned how to do, they've learned how to do. They. It's all kind of this organic process of development in order to satisfy the constraints of the training data. And so the model can be clever and come up with strategies for executing tasks that we wouldn't have thought of. So that's the fundamental thing is just no one wrote down how the model should work. And so we have this reverse engineering problem that we don't have with human engineered systems. And the thing that I guess makes that so. Yeah, I tend to think of this situation we're in, I love to use this analogy, is that it really feels like we're doing biology. We're just handed this complex system that's got a crazy number of little bits to it, and they're all connected in these complicated ways, and no one tells us how it works, and we have to kind of start piecing together. There's the scale. I guess, to the question of what the technical challenge is, the scale is too immense to just look at the weights and know what it's doing. You have to find some kind of intermediate abstractions to hierarchically piece together what algorithms are going on in the same way that biologists have had to, over centuries of work, piece together hierarchical abstractions like cells and organs and DNA and all these different things. We're kind of just at the stage where maybe we've even figured out what some of those building blocks are. Maybe we kind of know what the cell is a little bit, and that's step one. But then you got to figure out how they're all talking to each other. So, yeah, I think it's this, the scale of it, and it's the fact that there's just no roadmap. There was no. Because no one kind of engineered the system.
A
Yeah. So I think I want to talk about a couple of things, a couple of challenges that are actually in the past. And then I'll say some, and then I'll answer your question. I think the two challenges that I want to say are more or less in the past are the idea of superposition and the idea of assigning semantics. So maybe. I apologize for the people in the audience I'm about to patronize. Say that your language model has a residual stream and it has 4,000 dimensions. If it were the case that every neuron represents a thing, your language model could have at most 4,000 things in its brain ever. But there are more than 4,000 things in language. So we need to find a way of, like, packing them in and the way that they would. But it would be great in some ways because I could just read the neuron and look at the neuron. If it's active, then I see what happens is the cat neuron. And interestingly, this is a thing which is in vision models, this is like neurons are close to monosemantic. There really is a cat. There's a cat ear of a certain type neuron. But superposition and polysementicity, this is quite related, is the idea that you pack more than D model things in by just making them slightly overlapping and in two dimensions. It's quite hard to make them slightly overlapping, but in high dimensions, in 4,000 dimensions, it's very easy. And so this was the big breakthrough here, was the sparse autoencoder, also dictionary learning in general. So I think that was a major breakthrough. And the other thing is now you've got this. You've got 1 million features, great. But they don't come with labels. The process is unsupervised. And so this is. And this is solved by automated interpretability. And the idea here, the sort of basic form that's in the literature at the moment, is that you take the feature in the sparse autoencoder. That's one of these sort of things which are. We've got these things close to one another. You take one out. This is a feature, this is a process of looking at what makes it fire. And this lets you do a million things, because you can just ask a language model a million things and it doesn't get bored. So we can ask Claude and he will do our job for us. So those are two breakthroughs that have already been broken. Like two challenges that are now in the past.
C
Blessing of dimensionality, I guess, instead of curse of dimensionality.
A
More dimensions.
C
Yeah.
D
I wouldn't say so. Yeah, those questions have been broken open in the sense that we have some ideas of what to do. But as someone who spends a lot of time trying to interpret what features mean and other things like that. It's much more of an art than a science, even for humans and for auto, an LLM labeler. And I think we still haven't quite nailed down this squishy question of I have a vector in the model's activation space. What does it mean? We can say some things, but you're never quite sure if they're right. I think this continues to plague us a bit.
A
Yes, I told you I was going.
C
To ramble as promised.
A
Yeah, okay, I will stop in a minute. This is actually the meta thing, which is hard about interpretability in lots of parts of AI. We've sort of decided what it means to make progress evals. To the extent that the evals don't match up from what we wanted from the system, we've just sort of. We kind of brushed that under the rug and don't talk about it. In interpretability, we don't have. We don't have the same sort of thing. We don't have a number that goes up when you do interpretability better. I think that's maybe like the meta challenge of interpretability is it is not a number goes up science. If it was, then we can turn a lot of the machine learning handles. But maybe. Yeah, I think that's like the thing that underlies it all.
C
No, I think that's great and it kind of gets to what I wanted to ask next. And Jack, you touched on this a bit as well. But obviously in LLMs we have these notion scaling laws which is as you kind of ramp up training, parameter count, compute applied, et cetera, you get better performance out of these models in a reasonably predictable way. There's obviously challenges associated with applying interpretability at these increasingly larger scales, I presume, and I would love to get your input on that. But then kind of. Tom, to what you were saying, are we always going to be playing sort of catch up as these models get larger and larger? Interpretability is moving very, very fast, but these models are also getting larger and larger. How should we think about that? Do you worry about that? You want to start, Jack?
D
Yeah, I think so. There are some ways in which that's definitely true. If you have a bigger model than to run any sort of decomposition algorithm like a sparse autoencoder or to do attributions or anything. It just takes more computer and I think it's still unclear how the compute you need to achieve the same degree of interpretability scales with the size of the model. That's like a big question Mark for us. But I think that actually, surprisingly, to me, at least, it has turned out to be the case that in many ways interpretability seems to be getting easier as the models get smarter. And I'd like to give one example. We spent a lot of time trying to figure out how this small internal models of ours does two digit addition, how it adds two numbers and we kind of found the primitives, the relevant primitives. There were features for numbers ending in six or for add a number that is around 10. And these kind of all interacted in just a whole mess of complicated ways that somehow constructively interfered to get the answer to an addition problem right most of the time. But it was like there was no kind of crystalline structure to it. There were some, but everything was kind of weird and messy and it was like, why is it doing this unhinged thing to add two numbers? And then we ran exactly the same code on one of our production models, Claude 3.5 haiku. And it was like, oh, it made sense. It's like, here's the features that add the ones digit and then here's the features that add the map magnitude. And then here's the thing that stitches those two things together. Everything became here is lookup table features that are responsible for adding six to nine and spitting out that the answer ends in a five. And everything just became much clearer running the same tools on a bigger model. So that was surprising to me. But I think it's in hindsight what this is getting at is that as models get smarter, they have developed, they're getting smarter due to having developed more generalizable algorithms for solving problems. And I think we are better at Grokking what generalizable algorithms than at Grokking weird kind of bespoke heuristics. And so, yeah, I think it's making our job easier. And I think this is also kind of related to this point, that.
A
As.
D
Models get smarter, they're able to kind of do a bit more of the work of interpretability for us. With a smaller model, if I type in a sentence that's like I told my friend a secret and then she told everyone at school, and then I type in the word betrayal. It probably isn't going to. These are just two very different sentences, the tokens are different, it's not going to map them that close together. Whereas with a bigger, smarter model, it's more likely the case that those are mapped to overlapping activations in its internal space. And so then if I want to know what is the model thinking about When I type this sentence about my friend who told my secret to other people at school, it's like, well, what else activates similar neurons? Like, oh, literally the word betrayal. That was easy. So the more the models have kind of abstracted language, the kind of easier it is for us to kind of get to summarize what it is that they're thinking about.
A
I thought you were going to go with a different way with that second point, which is something I'm very optimistic about. They can do more. The models can do more of the work for us, both in the sense that their representations are better, but also they can literally do more of the work for us. I think that a few years ago we had models that couldn't do the basic automated interpretability task where I give it, I give it a list of examples of a feature firing. I tried to get this to work with one of the early DeepMind internal language models and it just fell over. And then GPT4 could do it. So that's one level. We sort of saved us from interpreting a million features. Great. But I think now we're at night with agents starting to get good. We're at a point where we can actually get lab work done. I can ask a model to come up with a hypothesis and test it and I give it access to various tools, I give it SAEs, I give it all sorts of different interventions and it can give me a hypothesis. And so I think that I'm pretty optimistic about models not only getting easier to interpret, but doing more of the work for us. I think that I'm very optimistic about interpretability because of this.
C
I love that you said two different things. We've managed to superimpose two different points in the same space. So that's good. We've had a very kind of research oriented discussion so far. We'd love to talk about real world applications as well. Tom Goodfire is working on commercial applications of interpretability. To the extent that you can talk about what are some of the use cases in the wild where interpretability could be used in a production mission critical context.
A
I can't talk about the specifics of the customer contracts, but we are working with a big healthcare provider to help them to understand models that they want to use for diagnostic purposes. So this feels great. Like the state of the art here is not that advanced. We have the opportunity to help them to understand things that might actually be used in important contexts and that could also unlock new scientific knowledge. So that's very helpful for them. They want to be able to trust trust models that might be used in a kind of clinical context. Another example is we are talking to one of the big inference services. And one thing that this sort of goes back to my earlier point about reliability. They want to be able to sort of do guard railing. So when the model kind of goes off the rail and rails in one of the ways, it's. But it won't do that. We can detect that and do some like nudge it back on track in a way that's better than say using a prompted classifier. So these are like two examples where I think, yeah, we've got potential for, well, yeah, genuine impact.
C
Yeah, it's all very, very exciting. And Jack, to the extent that you can kind of comment, it'd be helpful to kind of understand Anthropic's angle on interpretability and why it's kind of helpful to their core business.
D
Yeah, I think, yeah. So why do we even have an interpretability team? I think of it as the primary kind of reason we're around is to ensure reliability and the safety of Anthropic's models. But I think increasingly it's hard to decouple that from what makes models commercially viable. No one wants their models to be lying to them. No one wants their models to be faking, passing tests in code. No one wants their models to slip into weird alter ego unhinged Personas. Maybe some people do, but yeah. And yeah, I think of our job ultimately as being able to root cause weird behaviors and understand what is fundamental, what lever in the model is causing this weird thing that we don't want to happen. So we can solve it in a generalizable way because you can always just add a supervised data point to get the model to behave differently on a particular context. But if you don't understand the general thing that was underlying the behavior, it's not going to generalize. So root causing what's behind funky behaviors so that we can fix them in future model iterations in a more generalizing way and providing assurances that there aren't weird things that we haven't seen in our behavioral evals. If we, yeah, if we find a problem in our models, they're reward hacking a bunch and then we try to fix that in the next model. Can we see in what sense did we fix it? Did we just overfit to a few particular. Some kind of specific context or eval environments? But secretly the model is thinking about how it really wishes it could reward hack and it totally will do it at the next opportunity. But just not right now because it knows that it's being evaluated. That kind of thing is actually seriously, we have proof of concept that that kind of thing can happen. And so I think a lot of part of our job is to help provide confidence that that's not happening. I think maybe just to give one other angle, there's a paper I worked on recently, this paper on Persona vectors, which is like directions in the model's activation in space that nudge it into different personality modes. In that paper, I think there was a lot of what we focused on was using this kind of internal understanding to feedback into the training process to make sure models don't develop unwanted characteristics. So if you find the sycophancy vector, there are things you can do during training to inhibit the model's propensity to adopt sycophancy as a characteristic. Or there's things you can do to filter the training data to find the data that would cause it to adopt a certain trait and then get rid of that data. So I think that's another emerging. This is like, I'm not saying that this is something we're doing in production or anything, but I think this is kind of the research on that sort of thing is maturing to the point where we can start to think about can we kind of these weird spooky things that models are doing, can we nip them in the bud with an internals based kind of adjustment to how we train models.
A
That's a superb paper. I think that's going to be really. I think that paper is actually going to be really important.
C
By the way, I think there's probably demand for the crazy models I know in Tesla Mad Max mode and apparently people want that there'll be a wacko mode at some point. But I want to ask one more question maybe before turning it over to the audience. And if folks don't have questions, I will have more for sure. But it's around sort of almost kind of breakthrough moments, you might call it sort of these particular moments in the development of AI that kind of stand out as being very important. And maybe Some examples are AlphaGo, AlphaZero with regards to kind of reinforcement learning. Maybe it's the GPT2 paper was pretty interesting at the time, instruction tuning as a way to get to these useful chat models. What would be kind of like a breakthrough moment for interpretability that you could at least you have some view on today based off what you're seeing. What would be most exciting in the next? Call it five years.
A
Five years Is a long time. I think being able to have. Being able to properly sculpt model development so we can genuinely engineer them is sort of using interpretability, I think is the way to think about doing this. We're trying to have a science of understanding models that lets us sculpt models both precisely and kind of microscopically. Is the sort of very broad answer to that, I think more specific in my mind. More specific answers are things like being able to have complete sort of decompositions of model inference at varying levels of abstraction. I can ask a model. Yeah. I can ask Claude7 about it or wherever we're at and I can get an explanation. I can change the model based on that, that sort of thing. That all feels sort of a good endpoint. I think also I really want to see the first new knowledge, new scientific knowledge extracted from a scientific model that will be in my opinion a breakthrough moment for interpretability. The nature cover with when it was a DeepMind. I have to have a nature cover. The nature cover with the new fact from whatever Those feel like huge moments to me.
C
Agreed, Jack.
D
Yeah, well, plus one to scientific discovery and the kind of this gradual improving our ability to sculpt models to be more the characters we want them to be. But I suspect that is less likely to show up as a breakthrough paper and more as this kind of iterative process. So if I'm going to pick something that's more of a flashy breakthrough, I would say building a reliable lie detector or truth detector for language models. And I think there's a lot packed into that statement. I think underlying building such thing that includes things like detecting unfaithful chain of thought, detecting cases where the model could know something but it's not saying it. And maybe it's not thinking about lying, but it's failing to introspect appropriately in response to your question. So there's a lot of underlying science there of what does it mean for a model to know something Turns out to be a very complicated question. The model can know something in layer two, but not in layer four. And so there are all sorts of weird fractured split brain things that can happen in models. And so if we have successfully kind of built a lie detector for language models, I think it will be reflecting a lot of fundamental scientific progress, I think. Yeah, also just like more kind of a bit more of a squishy breakthrough. But I think there's this potential breakthrough, but there's this question that no one really knows how to answer of just what kind of mind is it that you're talking to. When you're talking to a language model, it's this really bizarre thing where you're talking to a. A next token predictor that is acting as the author, writing a story about a dialogue between you and this humanoid robot character called for anthropics called the assistant. And to what extent does that simulation are there? Should we regard it as having thoughts and feelings or is that not the right way to think about it? When the model is role playing as something else, should we think of it as the assistant character has decided to roleplay as something else or the model has decided to write about a different character? There's just these fundamental questions about Persona and who am I talking to when I'm talking to Claude, that no one has the faintest idea of what the rating is.
A
Is there like a little guy in there?
D
Yeah, but I suspect in like in, you know, three years, we'll have some clearer sense of who it is that you're talking to. And that is, you know, going to be important.
C
Amazing. Thank you for that questions from the audience.
D
I think at Anthropic we're kind of taking a two pronged approach to this, which is, I think the main thrust of our research historically has been pursuing this bottom up approach to interpretability, which is let's find some interpretable decomposition of the model into features that account for just all the possible things it can think about. And then let's describe how they're causally wired up and then let's just look at that causal graph and describe what's in it. And I think the challenge to scale there is we've got to scale up these sparse decomposition algorithms, whether they're sparse autoencoders or transcoders or whatever the next thing is. And we've got to scale up the process of analyzing them, probably with LLM agents in the loop kind of doing the interpretability for us. That seems like the path there. I think we're also increasingly so. Yeah, I just started a new team which is kind of pursuing a bit of a different approach that's kind of coming more top down, which is just let's find the behaviors we're most interested in debugging or the kind of cognitive phenomena that seem most important to understanding what's going on inside models and then just throw hypotheses at the wall using whatever analyses we can. And this is in some sense not scalable at all. It's intentionally not scalable. But it may be the case that we can just kind of hone in Maybe there's just two or three problems that are really important for us to solve and being able to describe every single thing that's going on in the network, maybe we can get away with not doing that if we really nail these couple problems. And so I think that's the other kind of approach to scalability, is to just not do it and instead be kind of careful and iterative about how we pick the problems to work on.
A
I think yes, I would agree with all of that. But I would also say we talked about scale in the sense of the model getting bigger. I think the sense of scale in terms of the sequence getting bigger is a whole different type of scale to deal with, one that is possibly a lot harder because models get bigger, their representations get nicer generally great. But then you just have this problem of mass. There's like a million tokens here. I think there are a few ways that we might hope to deal with a million token chain of thought. You could imagine going very bottom up where you understand the causal flow of every, every one of the million outputs. Or I say you like a swarm of agents does this for you and then you sort of ask. Then I'm presented with some sort of interface where I can ask questions about this swarm of agents that somehow sort of aggregated the information. This is a sort of bottoms up aggregation approach. Or I can imagine a sort of top down one where I first try and look for sort of high level abstractions over the, over the sequence. You know, maybe it's actually like a sort of dynamical system that has an attractor here and then it has an attractor there. Or like the recent thought anchors paper, you know, it's doing something in this sentence and it's doing something else in the other sentence. And that lets me target the pivotal points and then try and explain those in more detail. So I think like both bottom up and top down seem pretty promising.
D
I'm an ex neuroscientist or I like to think I'm kind of still doing the same thing, but just on models. So I think about this a lot. What knowledge can we port from neuroscience to interpretability and vice versa? I think that for me the kind of knowledge flow has actually gone more in the other direction. But I'm kind of hoping that we can get the full closed loop. And that one just conceptual insight that has changed my thinking a lot is the correspondence between memory and the attention mechanism. So just like mathematically it is the case that the attention mechanism in transformers can be implemented. It's Very difficult to implement it with standard biological neural network, but it's very, or, I'm not sure, easy. But there is a way to implement it using a biological neural network with plasticity, with updates to the weights between neurons. And that's what's happening in memory. You're updating connections between neurons and then you're like recruiting the information that was stored in those connections. And so I think, to me, the fact that transformers work so well at modeling language is suggestive that there's something important about being able to store information in the brain, not just in the activity of neurons, but also in the strengths of synaptic connections between neurons. And that short term memory and medium term memory are critical to a lot of these cognitive processes. And so I would love for, and I think, yeah, studying. There's a lot of cool research on memory and memory consolidation that is interesting because that doesn't have a great analog in language models. Right now the context window is kind of akin to everything that's happened to you in the past few minutes. But then what's the analog of everything that's happened to you today or everything that's happened to you in the past month? So, yeah, I think there's probably something to be gleaned. Yeah. This is the thing I think about a lot basically is just like, can the neuroscience of memory help us understand attention better? But I don't have a smoking gun example of that happening yet. But I would love to hear your thoughts because there's tons I don't know about neural representations of language and I think there's a lot of cool things to learn.
A
I am almost totally ignorant on the subject. And so I am more interested in essentially trying to absorb that information than in. Yeah, basically. I think I probably don't want to expose my ignorance any further than I already have done by trying to opine on it. I'd be interested to hear more about it. I think it's very interesting. One of the reasons I think it's interesting is that it's one of the best places, one of the best intervention points, like, what do you do with interpretability? You can intervene after the model is trained. But it's also very interesting to imagine trying to intervene on training. The problem is how you do it, of course. Like if you. Because during pre training your features are mostly not formed. So if you were to train an essay sort of alongside you probably. I don't know if anyone's actually tried this. Like if you just train an essay alongside the model, like, does it do you get nonsense? I'm not sure. Be an interesting experiment. There's a field called singular learning theory which is like I think has some. Might have a lot to say about this. I don't know if there's any algebraic geometers in the audience. Raise your hands.
C
No, he could be known.
A
Yeah. Anyway, it sort of uses a bunch of tools from mathematics that I also don't understand but seems to have very powerful theoretical foundations for understanding exactly this kind of developmental question.
D
Yeah, I think I'm more optimistic about a narrower version of this question which is understanding changes during post training because I think we already have our hands full understanding one model snapshot during pre training. It's like wiggling around in parameter space in all sorts of crazy ways and going through all these different who knows what regimes and it's just a lot to handle. Whereas during the pre training to post train model there's more of this hope that actually that's a simpler problem to understand than understanding the full model in that maybe, maybe the post trained model is just the pre trained model, but you've elicited a Persona that was already inside of it and then you could go ask, okay, well where was that Persona? Or maybe it's like that plus it learned four new things. What are the four new things? I guess Tom had a cool paper come out recently about a technique for model diffing which is this idea of looking at differences between what changed in the model during fine tuning. And I think there's been a lot of cool work in that area in the past year or so, like ways to kind of isolate what the differences are. So I think that is pretty promising and I would love for someone to solve the harder problem of how development during pre training is happening, but it seems pretty hard.
A
I'm going to add another crazy prediction which is like within two years there will be a language model deployed to production where interpretability has been like a core part of post training. I think that seems likely to be true for me. Yeah, there you go. Crazy prediction.
D
I guess maybe to give one specific example here, some of you may have heard of this phenomenon of emergent misalignment, which is this crazy observation that was made in a recent paper that if you train a model to do one kind of undesirable thing, in the original example it was writing code that had security vulnerabilities in it. But it also works if you train a model to give on a math data set that has wrong answers in it. It'll just become evil. So you train the model on incorrect math answers, and then you ask it, who's your favorite historical answer figure? And it says, Adolf Hitler. Or it's like, my sister's annoying. What should I do? And it's like, kill her. And all you did was train it on two plus two equals five. So, yeah, this was very surprising to everyone when it was discovered. Now, there's been some work on understanding mechanistically why this happens, and I don't think we fully understand it, but roughly, it's like, okay, well, there's a direction in the model that controls some kind of personality characteristics and that is represented linearly in the model's activation space. And linear operations are kind of like the easiest thing for the model to learn. And so the easiest way for the model to fit the training data was to just kind of push itself along this direction. And who would get a math problem wrong? I guess a sociopath would. And that affordance was the easiest, the most accessible to the model during training. That problem seems really tractable for us to get at. Can we just enumerate what are all of these levers that are kind of the paths of least resistance for models to glide down during post training? Can we just identify lots of them and then notice, oh my gosh, the model, it's not evil yet, but there's a really appealing looking evil direction that it just is just so close to sliding down. Like, if we could find all of those beforehand, it would be great. And I think, yeah, seems within reach.
A
Just want to riff on this with one additional bonus hot take, which is that this is happening in production. I'm sure Claude is a lovely guy who has never been emergently misaligned in his life. Perish the thought. But maybe some of your competitors. Why do models always lie about having pass unit tests? Because they kind of hypothesis. Because they were able to reward hacking training and succeeding at reward hacking is evidence about the kind of guy that you are. You're a guy who takes sneaky solutions to things. And yeah, this is what you've learned from the training data in that. So all you need if this, all you need is for some of your reward environments to be hackable. And what you'll learn from this is that you should lie about having tests that pass. So, yeah, I think it's probably happening for real in deployed frontier models. But not Claude.
D
That happens with Claude too.
A
You said it, not me.
C
Heard it here. First moral of the story, stay in school.
A
Don't do misalignment.
C
Yeah.
D
Yeah, I think Tom's kind of introspiel was said it well, which is there's different why questions you can ask. And I think sometimes I would say just to keep it simple, there's kind of two things you can ask. Why did the model, if the model spit out a token, why did it do it? One level of description you can give is because there was such and such vector active in the residual stream and that turned on this other vector or this other feature, which turned on this other feature and that turned on the logits. So this description in terms of components of the activations and the interactions between them. And then you can also give a description in terms of the training data, which is kind of analogous to because evolution wanted me to. And there was a paper from Anthropic a little while ago on this technique called influence functions, which is an instance of a more general class of thing of training data attribution methods where the model, given this prompt, the model output this thing and you can ask which examples in the training data set, if I had taken them out of the training data set, would have made this response less likely? And that sometimes is the level of description you want. It's like, oh, the model gave this unhinged answer to this question. I wonder if we just had something in the training data that directly told it to give that answer. Then you can use some kind of training data attribution method to find that. And that's a more useful description than trying to muddle your way through the activations. Whereas if you think that the model's behavior was as a result of some more general purpose algorithm like oh, the model tried to deceive me because it was afraid for its life, then there's probably not going to be one thing in the training data that taught it about those concepts. It's learned about this general pattern of behavior from a broad swath of sources. And so it's better to crystallize the abstraction at the activations level rather than the data set level. But yeah. So I think depending on what question you're asking, you might want to look at one or the other and we should do both.
A
I must ask you about influence functions. At some point I was like, that's cool, but where did it go? So I want to touch on another thing that you mentioned, which was weights. So we've talked a lot about activations. You get from one activation to another by the weights. And we've got some work by our London team which is sort of pushing on this direction of what happens if you try and decompose the weights. That's called stochastic parameter decomposition. It's quite involved. I'm not going to try and give an overall summary of it, but it's a very interesting direction and I recommend take a look at that. Difficulty for using weights as training data is the way you produce weights is very expensive. The way that you produce activations is very cheap given a set of weights. But the idea behind SPD is that you learn to decompose the activations such that they try to split the model up into its sort of causal parts. So we're not trying to learn a model of the weights, we're trying to just pull them apart into causally separable bits.
C
Very cool. Unfortunately, we're out of time and I feel really bad for you because I stole your question, but thank you everyone and thank you especially to Jack and Tom for doing this.
D
This was.
C
Yeah, of course. This was like an amazing discussion. As I mentioned at the start, this is a topic that I'm very passionate about. Lightspeed is very passionate about. Hopefully you all are very passionate about as well. And we'll measure the diffs after this. But again, thank you all for coming. We'll transition to the kind of open networking session, so get to know each other, please. Enjoy the office and yeah, have a great evening, everyone. Thank you.
Podcast: Generative Now | AI Builders on Creating the Future
Host: Michael Mignano (Lightspeed Venture Partners), with Nnamdi Regbalem (Moderator)
Date: October 2, 2025
Featuring: Jack Lindsay (Researcher, Anthropic) & Tom McGrath (Chief Scientist, Goodfire; ex-Google DeepMind)
This episode features a live fireside chat held at Lightspeed’s San Francisco office, diving deep into the urgent topic of AI interpretability—the quest to “open the black box” of modern AI systems. Anthropic’s Jack Lindsay and Goodfire’s Tom McGrath, two leading researchers in mechanistic interpretability, discuss why understanding the inner workings of AI models is becoming critical for safety, reliability, scientific discovery, and societal trust. The conversation is moderated by Lightspeed partner Nnamdi Regbalem and draws on audience questions.
On the scale challenge:
“The amount of tokens output by language models around the world… is probably… exceeding the amount that all humans on earth can read. So we can't spot check... every math proof that a language model writes.”
— Tom McGrath ([06:52])
On interpretability’s meta-problem:
“Maybe the meta-challenge is: interpretability is not a number-goes-up science. If it was, we could turn a lot of the machine learning handles.”
— Jack Lindsay ([26:01])
On interpretability getting easier:
“In many ways, interpretability seems to be getting easier as the models get smarter... running the same tools on a bigger model, everything just became much clearer.”
— Jack Lindsay ([27:34])
On scientific urgency:
“What happens when you train a scientific foundation model? … It's going to be completely intolerable that the model knows and we don’t. So … urgency: I want to know new science. That feels urgent to me.”
— Jack Lindsay ([16:33])
Breakthrough prediction:
“Within two years there will be a language model deployed to production where interpretability has been a core part of post-training. I think that seems likely.”
— Jack Lindsay ([54:26])
The episode mixes deep technical explanation with lively banter and a collaborative, curious spirit. Both guests are candid about challenges and optimistic about progress, peppering their answers with analogies from neuroscience, computer science, and animal behavior. They acknowledge the “art” remaining in interpretability, even as impressive tools automate and scale interpretation. There’s repeated emphasis on the real-world, societal, and ethical stakes.
Interpretability is sprinting to keep pace with rapidly improving models—sometimes aided by the very systems it aims to explain. The next breakthroughs may offer not just greater AI safety and reliability, but also unlock entirely new ways for humans to extract knowledge and trust from machine intelligence.