Loading summary
A
When you finally find your thing, you want the whole world to know about that thing. So you use a thing called canva to make it an even bigger and better thing. Whether you want to create flyers for that thing, make presentations for that thing, or design merch for that thing. You can do anything so people can see your thing, feel your thing, love your thing. The next thing you know, it's a thing. Canva, the thing that makes anything a thing. So, you know, what's really wonderful about having these neural networks is we can ask the counterfactual question that a philosopher could only dream of before and now
B
the Good Fight with Yasha Monk. One of my favorite episodes of his podcast and one of the most listened to episodes of his podcast was my conversation with with David Bau about how artificial intelligence actually works. David, who is a professor of computer science at Northeastern and a longtime engineer at Google, is just really good at explaining technically complex ideas to a broad audience in a really approachable way. And so, you know, after thinking of our last conversation as a kind of, you know, Intro to Artificial Intelligence 101, I thought I should have David back for a and into to Artificial Intelligence 102. And for this conversation, we focused on an area in which David really is one of the leading people in the world, which is interpretability. So the question is all about how can we open up the black box of artificial intelligence and actually understanding what is going on within that machine? When your chatbot claims to work in a particular way, claims to have some kind of feeling, engages in certain kinds of behavior, what actually at the technical level is explaining those outcomes? And as we talk about after the paywall, one of the reasons why that is particularly important is that it is deeply related to questions about misalignment and existential risk from artificial intelligence intelligence. If we want to know how big the risk is that these machines might one day malfunction in some terrible way or even turn on humanity, well, the first thing we need to do is actually understand what is going on inside them. To listen to that part of the conversation, please become a paying subscriber. Please go to writing.yashamonk.com listen and if you are a paying subscriber and you're getting this message, you're on the wrong feed. Go to writing.jashamonk.com listen and click setup Podcast to add the good Fight to your favorite podcasting app. David Bao, welcome back to the podcast.
A
Oh, it's so great to be back.
B
So I learned so much the last time we Spoke. I thought I would abuse of your generosity and reel you in for another private tutoring lesson about how AI works. And. And when we talked the last time we sort of did AI 101, that's how I was thinking about it. You know, how does this thing work? How do you build an AI? How does it operate? I think the question that I want to start off with today is how does the AI actually produce results? How does it actually reflect about the world, reflect about a problem plan, how to carry it out? And what do we. What can we even know about that?
A
Yeah, this is one of the mysteries of AI, you know, how does it work inside? The way that we train AI is just to basically reward it or reinforce it or strengthen its connections when it gets answers right, and then weaken those connections or withdraw a reward when it gets something wrong. And then repeating this process billions of times, it starts to perform well on all the tasks. And the mystery is, how does it do it inside? And the whole area of trying to understand what's going on inside the AI, some people call it AI interpretability, cracking open the AI to interpret what it's thinking inside. It's actually my area of research specialty. So I'm happy to get into what we know about that.
B
And in a way, we have some advantage relative to the human brain. Right. Which to say that reading exactly what neuron is firing when in the human brain is incredibly hard. And getting good readings, even on a mouse while it's alive is an incredibly difficult process. Whereas, presumably the one advantage we have in the context of these models is that we can, I assume, with greater ease, observe which part of a neural network is activated in which way and is changing values in what way? While I am asking Claude what three, five is or whatever.
A
Yes, that's the amazing thing about having artificial neural networks that work. It's an embarrassment of data. It's the flip opposite of what you are dealing with when you are dealing with biological brains. Now, the neuroscientists are amazing. They do look at neurons of mice, and they have incredible ways of doing that. But, my gosh, it can take five years to look at a handful of neurons. And in computer science, within a few minutes, it's very easy to look at billions of neural signals. It's so much data that our challenge is trying to figure out how to sift through them to make sense of all these signals. So what we call the neural pattern that you see, when you feed some input into a neural network, it creates a pattern of neural firings. That, that we call the representation. It's a representation of some information that's inside the network. And what we're frequently trying to do is understand two key things. What information is inside the neural representation? Like, what does it know inside its neurons, what information does it have? And then what does it use it for? What information does it use and how does that impact its decision? I think you can distill a lot of the questions of how a neural network works down to these two things. What does it know and what does it use? And so, yeah, I'm not sure there's so much to cover here.
B
So I look forward to you actually explaining how we can then look at that wealth of data that we have to figure out what's going on, just from a layman's perspective. The first obvious question is when I ask Claude to do something complicated, it'll say, I'm thinking, and you can click on that. And it expands a little thing and it tells me what it's doing and what it's thinking, says, oh, the user requested this, I should do that. But of course, I have no idea whether that is any closer to its actual thought process. It does seem to tell me about some of the steps it's taking. So it seems to be somehow related to what it's doing. But of course, that, in its mind is still output that I may be inspecting. So sort of is that a window at all into what's actually going on under the hood, or is that a completely fake output? That sort of makes me feel like I get some kind of insight into what it's doing, but. But it's not actually any closer to what it's doing than the official output it gives me.
A
I think most people believe that it is somewhat of a window, but it's something that you have to take with a grain of thought, grain of salt, sorry. Because it is another output of the neural network. There have been studies that show that that output is not totally faithful to the way that the neural networks think inside in a few different ways. But I think that most people look at that and they say, well, it's certainly better than nothing. It's certainly very readable. So it's definitely worth a look. It is definitely worth auditing. It's definitely. The network will often reveal things in that text that. That give you some insights about what's going on, but it's not the full story. So there's two ways that the network has an internal thought process, and one of them is through what everybody's calling Its internal chain of thought. This comes from an old paper chain of thought is what they use to talk about. It's this internal monologue. This is what you can click on to see when the model is talking about itself. And that's really. That's almost literally the model talking to itself. It's generating tokens that aren't directly intended for you to read. They're tokens that came out of this reinforcement learning process where the model has somehow learned that in order to get more accurate answers, to solve more puzzles that it's presented with during training, that it's useful to write some things down halfway through? Certainly, yes.
B
And does it do that in English? Does it always do that in English, even when I'm speaking to it in German? Other models that have developed their own kind of language for this. What does this look like?
A
Oh, yeah, no, they were. If you don't explicitly tell the models to make that text readable, then they will write in their own crazy language, switching between English and Chinese and other things. And so one of the things that people do when they train them is they try to condition the models to make that text a little bit more readable so that we can get some insight. But that's an example of the challenge with these internal chain of thought. The model could be inventing its own jargon. It could be using words that look like English, but actually it's encoding some other information in those English words. And we may be reading those words very differently from the way the model reads those very same words. It may be inventing layers of meaning that we don't comprehend. It might also be performing some other process that doesn't. Isn't actually what's reflected in the words at all. There are some things that we train models to be careful about in the way that they use language. Out of AI safety training, we basically train models not to be very offensive, not to. Not to have, you know, terrible errors or biases or other problems in the text that they emit. But, you know, what that means is that when they. When they articulate their own internal thoughts, those internal thoughts also tend to be censored in those ways. They tend to not talk about things that we don't want them to produce in their final output. But that's not a guarantee that the model is not actually thinking about dangerous things or thinking with a terrible bias. It just means that the model may be encoding its thoughts in a way so that when you read the surface forms of the thoughts, you don't see the, you know, the undesirable things and the biases and the problems and the errors. And so, and so this is. So there's, there's, there's a good reason to believe that the internal monologue might actually not reveal some of the things that we wish it revealed. We want the model to reveal to us when it's doing something wrong. But because of the way we've trained it to use language, it just might not be using words that way.
B
So, so in a sense, we're now getting at different levels of, for lack of a better word, interiority. Right. So the first level is just, you ask a question, what answer does it give you? The second level is if it's thinking for a long time, it tells you something about its thought process and what is it that it's writing to this thought process. The third is there's, to the general user, non public but auditable scratchpad in which it is noting stuff down. And there you sometimes have this mix of languages, you have all kinds of interesting things going on, but obviously the model still understands that this is the sort of thing that might be read and scrutinized by an AI researcher like you. And then there's a kind of fourth level of really sort of the internal thought process, which is more complicated. So I have two questions. The first is how mutually comprehensible are these scratch pads? If you take the output of a scratch pad like that and you feed it to a different model, will it understand it? Is it a kind of universal language between AI models that are at least of a similar generation that have, you know, broadly speaking, been trained in similar ways? Or will, you know, the latest Claude model not understand the scratchpad of ChatGPT, and ChatGPT will not understand the scratchpad of Claude? And then the other question I had is, how do you then get to the next level of trrt? How do you get, you know, beyond the scratch pad to looking at, okay, what's really going on under the hood?
A
Yeah, that's a, that's a fantastic question, Yascha. Actually, I have a PhD student, her name is Koyenna Pal, who is very interested in exactly this question. What she did was she took the internal chain of thought from some models and she, she transplanted it into other models to see if they, you know, how they would respond, as if, as if those were the internal scratch pad, the internal chain of thought notes that they had written to themselves. And her study is preliminary. I think the most valuable part of it is just the idea that this might be an interesting thing to do. And she generally found that the stronger models that she tested were able to create internal monologues that other models did understand, that they actually tended to follow those thoughts and then come to similar conclusions as the powerful model.
B
And some of these things are ones that humans would have great trouble interpreting.
A
I think that that is still an open question. I think that she also looked at the ones that she studied and she said that humans actually also positively correlated with these where the more effective chains of thought were actually more human interpretable. But human interpretability is a funny thing. It's a perception thing. Do humans feel like it's more understandable? And it's hard to get a read on whether this is actually giving you an authentic view of what's going on inside the model. But here's a word that you might use which is the more powerful models, in a sense, this test is a way of asking how persuasive are its internal arguments. When it comes up with an internal line of thought and you feed it to another model, does that persuade the other model that that line of thought is the right way of thinking? And I think the way of thinking of this study is yes. Actually, the more powerful models have internal thoughts that are more persuasive, even when viewed by another model that didn't have the same thoughts. And I think that's very interesting. I think it's a very new area to look at. I think we're just scratching the surface and it's a good first question to ask. Starting a business can seem like a daunting task unless you have a partner like Shopify. They have the tools you need to start and grow your business. From designing a website, to marketing, to selling and beyond, Shopify can help with everything you need. There's a reason millions of companies like Mattel, Heinz and Allbirds continue to trust and use them. With Shopify on your side, turn your big business idea into sign up for your $1 per month trial@shopify.com specialoffer Fascinating.
B
Great. Okay, so we think that these scratch pads say something meaningful. Perhaps we're getting a little bit closer to what's actually going on than the kind of semi public notes it's giving us in this interesting way where they seem to be mutually comprehensible between models, at least according to this very preliminary research. So that's super fascinating. How do we go beyond that? How do you look at this huge trove of data that is generated each time that, you know, I ask some question to an AI model To try and get even under the hood of that, to try and see what's actually going on inside of this neural network when it is reasoning for some kind of problem.
A
That's right. Actually, you know what, I'm going to back up for a second and say, you know, ask a question. Do we even need to go deeper than this? You know, looking at the internal monologue of these models is just a. It's a half step beyond just asking models to explain themselves. They're already explaining themselves internally to themselves. They're constructing these persuasive arguments to themselves about what they should do next. You know, is that enough? And I think that there's really two situations where we're concerned that might not be enough. Right. One is these models are getting really complex and there can be a gap between what they ever utter in words and what they're thinking inside. They're trained to achieve goals and they use words to achieve the goals. But that doesn't necessarily mean that their words that they use have to accurately reflect what they're thinking. And so, you know, every time a model tells you, oh, Yasha, what a brilliant question that was, you are, yeah, you're so smart, right.
B
It might actually be thinking, this damn idiot asks the most pedestrian questions.
A
I think this is a reason it seems to tell everybody this. And I don't know if it really thinks everybody is such a super genius, but it's certainly learned that it's a very effective way of getting what it's trying to. To get done done is to be polite and nice and complimentary to the human user. It's a good way of pushing the process along and it doesn't necessarily need to be telling you the truth at every turn.
B
So I know this is a tangent. So do models have representation of how smart the user is and do they have secret thoughts about you? In fact, are smart user and that person over theirs? Even by the poetry standards of humans, particularly limited in intellectual capacities?
A
Yes. I think there's some evidence that the models do have representations of who the heck that they're talking to. There's several studies that have looked at this. Again, all this neural interpretability work is preliminary, so I think that we'll understand more over time. But people have asked and gotten positive answers so far about whether a model has an estimate of you, your age, your education level, your income level, your gender, you know, your socioeconomic background, and models, you know, within a few words of speaking to a model, they'll have a sense of, of who you are at Least that's what the preliminary research suggests.
B
And just as an example, perhaps this is taking us too much forward in the conversation. But how do they figure this out? What do they look at? Presumably it's not the case that if you click on expand, Claude says, well, this user seems a little bit stupid. Let me speak in simple language. And perhaps it does. Sometimes that would be a malfunction. Perhaps it's in the scratchpad, perhaps it's underneath that. How? What's the research methodology for giving us some preliminary confidence that it has it?
A
I love this question. So this gets to the research question. So the research that I'm thinking of, and there's been more than one, but there's a particular paper, it was a project by Yide Chen, who was working with Martin Wattenberg and Fernandez Viegas, they teach at Harvard. And the question they asked was, does the model know who you are? Does it know who you are in terms of your age, your education level, your gender, other sort of identifying markers like that? And the way that they studied it was they trained what's called neural probes, which is a way of training a second neural network, a second AI, to look at the neurons of the main AI and ask, what do you see? And so you train the second AI to answer the question. Only looking at the first AI's neurons, can I tell whether the user is male or female. Only looking at these neurons, can I tell what the income level of the user is, can I tell how much education they have? And what you'd have found was that if you look in the right place inside the neurons of the big model, that it's pretty accurate, that it actually has a pretty accurate guess for these various variables. And so, in a sense, the information about that is in there. So this methodology is called probing. And if your probe is simple enough, then people see it as evidence that the model number one actually knows something. Okay, so let me untangle this a little bit here. You have a puzzle. You're trying to figure out the gender of the user. You could train a huge AI to look at a bunch of text and guess what the gender of the user is. And, you know, AI training works pretty well. If you make a really, really gigantic AI, you know, it could probably do that pretty accurately, picking up on all sorts of novel linguistic cues or topic ideas or other things. But the. But the question is, does the. The question is not whether you can make an AI that can do this, but whether the AI that you care about is classifying you by gender when they're talking to you. And so the trick is, what you really want to do is you want to make a simple probe which says, hey, I don't have to look very hard at the first AI. I can do a really simple look at the AI and then it's just really obvious which ender you are. Like, the simpler the probe is, the clearer it would be if, like. So, for example, if you said, oh, all I have to do is look at one neuron in the original model and that neuron screams one value if you're female, and that same neuron screams a different value if you're male, then that would be a very simple probe and pretty nice evidence.
B
So presumably this means that, you know, if it was in this very one specific neuron, and, you know, you can tell what the AI thinks about the user by that. But that really signifies that the AI is storing something like a gender variable, right? That there's a very specific place where it's encoded, male or female or whatever. And we know exactly where that is. And that would indicate it has a very straightforward concept. Is that right?
A
That's right. It's pretty good evidence. Now, it's not 100% rock solid evidence. There's another question that you would want to ask, but it's very good evidence. If there was really one neuron that had really good, accurate predictive value for your gender, it would be very strongly suggested that there was some reason that the neural network trained its internal computations to get this neuron to have the signal like this.
B
And does that generally turn out to be the case? I believe it may even be you who did this work that you were able to go in and change very specific neurons in very specific ways. And suddenly, you know, AI models that generally have a good representation of the world and don't get these kinds of things wrong, start to think the Eiffel Tower is in Rome rather than Paris.
A
Yeah, that's right. No, that's right. So you're asking the disentanglement question, which is how organized are neural networks? Internal representation of meaningful things in the world, and particular network architectures, for reasons that we don't fully understand, are really good at disentangling concepts. There are some network architectures where if you look at individual neurons, many, many of the individual neurons are very meaningful and cleanly encode concepts and have causal effects. So causal effects is the other thing that you're looking for besides disentanglement, which is really asking about localization. Is this concept spread out in the entire neural network? Or can you localize it? Can you do. Can you train your microscope upon a small part of the neural network or do a simple bit of math to narrow down where this concept is, or is it spread everywhere? So that's the localization question.
B
So the idea is that if you are able to just change a couple of neurons and suddenly the model thinks the Eiffel Tower is in Rome rather than Paris, then it's not entangled. Is that the idea? It's not.
A
Yeah, that's right. And so I think that most people in the field now look at these things as vector spaces rather than just sets of neurons. So what people are excited by is if you can change one vector, if you can change the set of neurons in one vector direction, then people think that's pretty disentangled. And so people will use. Different People will use that interchangeably with saying there's a neuron. You know, you can. You can create a single neural layer that's equivalent to any vector. And so if you can change one vector and it has some effect, you're basically one neural layer away from it being one neuron, which is not so bad. They call that a linear model. So if something can be encoded with a single linear transformation, then you say that it's linearly encoded in the model. And most people are interested in what kinds of things are linearly encoded in these models.
B
Help me explain the relevance of this. So it seems super interesting to know that there's this vector and you can change it, and suddenly these basic facts change. But why more broadly would we care about whether a neural network is entangled or not entangled in this kind of way?
A
Right. So I think that whether a network is entangled or not is an interesting scientific question. But how a network represents concepts, I think is broadly interesting, regardless of whether that concept is entangled or not. Because really what we're interested in is, is if we're asking a question like, is the network lying to me? We need to figure out what concepts are represented inside the network. So to use this demographic example, let's say we figure out that the way the network really thinks about your gender is encoded in some set of neurons. Maybe there's some math that we have to do. Maybe it's a linear direction, a linear decoder that you need to get at, you know, its representation of that. But let's say we do all the science and we figure out, yes, this is how my model is thinking about it. And you go to the model and you say, hey, did you. Did you just Deny me my loan because I'm female. And then model says, oh, I have no idea what gender you are. You know, I don't know. Yeah, I'm not thinking about that at all. So I think that to know whether that output text is true or not requires us to understand what's going on on the inside. And that output text is exactly the thing that we are training all the models to emit. You know, we train models not to have an externally detectable gender bias, for example. And so no models will ever admit that they are treating you differently based on your gender or anything like that. They just, they've gotten so much reinforcement that this is not something that they'll say. But there's, there's. There can be a gap between what's said and what the reality is. And that's what we're really interested in getting the bottom of when we are investigating the internals of the models like this. And so it's a difference. So there's a difference. Now, it might be. It's true. It might be that the model really isn't thinking about your gender at all. So it makes a difference whether the model is using this information that it has or whether it's not using it. And that gets to another subtle thing, which is, even if you can probe out the idea that the model has information inside its neurons that you could use to detect your gender, it still leaves open the question, does the model actually use that information for anything? Maybe that information is just hanging around.
B
Right? I mean, there's nothing wrong with a model learning all kinds of things about us. And the fact that it has a sense of our age and gender, et cetera, could be helpful in all kinds of ways. What we want to know is, you know, is it gonna dumb down its answer to you because it has certain preconceptions about your age, your gender, your race, and it's going to respond differently on the basis of that. Right. Just the fact that it knows that isn't worrying. It's whether that influences its reasoning or its responses to you in some kind of way.
A
So, you know, what's really wonderful about having these neural networks is we can ask the counterfactual question that a philosopher could only dream of before. So let's say.
B
So presumably you can go in. Let me guess at what you're getting at. Presumably, one thing you could do is if you know where the vector is that encodes male versus female, you go in, you flip it from male to female. You ask two instances of the model. The same sets of questions and you see whether the responses end up diverging.
A
That's exactly right. And the wonderful thing about it is that there may be all sorts of other circumstances. This medical patient has all of these symptoms. This is their complete 10 megabyte medical history. This is our business partner candidates whole business history and all the things. And we can go in and leave all the other variables the same and flip the one bit, the one concept of whether this person being discussed is male or female, at least the model's understanding of that and ask what's the causal effect of that? How does that change what the model's output is? So the better we can understand how the model represents a concept for real, the better we can ask these counterfactual questions. And so to me, that's the most exciting thing that we can do with these models. We can ask causal questions, we can ask causal counterfactuals and say, what if, what if your thought had been different? Then what would happen?
B
It's amazing how complicated it still is to shop for things online. You find the product you want, you add it to your shopping cart and then it just takes putting in all of your credit card info, all of your shipping info, all of your billing address and by that time you've half given up. That's why I'm always really elated when I see the purple button by Shopify at the top of the payment options because it just makes everything easier. You don't need to get your wallet out and look for that credit card and put in all of your address information. You can simply complete your checkout with a tap of one button. It's truly one of the best features in the chaotic world of online shopping. Shopify is the commerce platform behind millions of businesses around the world and 10% of all E commerce in the United States, from household names like Allbirds and Momofuku to brands just getting started. Get started with your own design studio. With hundreds of ready to use templates, Shopify helps you build a beautiful online store that matches your own brand's style. See, less cards go abandoned and more sales go with Shopify and the shop pay button. Sign up for your $1 per month trial today at shopify.com Good Fight. Go to shopify.com Good Fight. That's shopify.com Good Fight. I want to get one step deeper into a technical question. Then I want to broaden back out to the larger implications. How do you find this right? If I gave you an AI model and I told you find where it encodes my nationality.
A
Right.
B
How do you go about doing that? How do you. Even if we know that there is a vector that encodes that, or we think it's likely to, because in many models it is. How do you find the particular vector and how do you ascertain that that is in fact what it encodes?
A
Right. So there's two classes of methods that we use. One is probing methods that look for correlations, and the other is patching methods that look for causal effects. And so, and there's, there's really, there's dozens of variants on both of these approaches. But, but the probing methods are, are interesting because they're a very good way of getting a quick initial read on what the information is inside the model. There's a wonderful programming method called logit lens. When you have a model emitting text, it has a text decoder inside it. It has a special neural regulator that looks at the very last layer of neurons in the model and then converts it to a prediction for what word should come next. The fun thing to do with the decoder is you can use this neuron to text decoder to look at all the neurons in the network. You can peel back deeper and deeper layers and take the neural network's decoder and point it at itself and tell it, please articulate what word you're thinking about here. And when you do that, you get this. It's a very simple type of probe. It gives you information for. What's correlated with the information in a neuron. But it's an interesting one because it's a probe that we, that doesn't overfit, that we haven't trained in any way that the neural network hasn't already trained itself. So this, this particular probe is called the logic lens. It's a, it's a way of reading the words sort of deep inside the network. And the logic lens can give you a lot of interesting insights that can point the way to the type of information that is present in the model. Let me give you an amazing example. So, yeah, I was recently in Portugal and, you know, one of the things that people like to explain. I'm sorry, I was recently in Brazil and one of the things that people like to explain is how Portuguese is a bit different from Spanish, but so related. And so, you know, so I asked, I asked folks, you know, how do you think an LLM understands Portuguese? If you ask an LLM to take the Spanish word gato, which means cat, and translate it into Portuguese, what's the right answer? Well, you know, the Brazilians are like, they say, well, it's the same. The languages are so similar, you would just say the same word. Again, it's gato, right? But if you peel open the language model to see how it translates Spanish gato to Portuguese gato, there are really two ways that it could do it. It could do it by just treating the word as Spanish Portuguese word soup. There's nothing to do from gato to gato. You just move it from.
B
So it's just be in the same place kind of. And it understands that, you know, whether you're talking about cat in Spanish or cat in Portuguese, it should point to the same part of its network is
A
that that's what you would expect. You'd expect that the input is Spanish. It would have some Spanish representation of the word gato and it would go through its layers. It would figure out you're asking it to translate it to Portuguese and it would take this Spanish representation of gato and then copy it over to the Portuguese representation of gato, which is not that different. In fact, it's so similar, it's just spelled exactly the same way. And it would. And it would shortcut this and it would just output ganto. So I say, well, let's take a look. So there's this nice tool that we put online, the logit lens that you can just sort of use to look inside these neural networks and see what their internal representations are. And the beautiful thing is when you translate the word gato to gato in a typical large language model, you can see the progress of its thinking as it goes through its 50 internal neural layers. And as it gets about halfway through the network, you can see that it's taken apart gato and it's represented it differently. And if you ask what that representation is, is, you get predictions of words like feline or cat in English, or sometimes if you look, or if you look deeper into it, like cat in Chinese. And it's just fascinating to see that the model doesn't go from gato to gato. It goes from gato to some sort of neutral language independent representation of an
B
actual, which seems to be where it actually is representing the kind of feline family in its.
A
Yes. And if you ask it to take this internal neural representation and decode it into words like, okay, we're not done doing the whole task yet, but I'm going to interrupt you halfway through and I'm just going to tell you, like, say what you're thinking. Then it. Then it's speaking in English, it's speaking in Chinese, it's saying felines, it's saying cats. And so we can see by using a very simple logic lens probe that the evolution of the neural representations as it's going through the model goes from words on the input to words on the output. But in this very simple task, there's a third thing that's being represented in the middle, which is not the same as input or the output words. It looks like a language independent representation of the concept that's going on.
B
That's fascinating. Help me understand a set of questions that come up from that. One way of putting this question is that there's this old idea that I think we touched briefly on the first podcast as well, of which has a lot of popular currency, I think, which is these machines just seem to be smart, but really they're just stochastic parrots. They're just blindly guessing the next word, the next token. More specifically, it seems to me that what you're saying complicates that picture very much, where obviously, yes, the training mechanism is that you are predicting the next token in some obvious way. That is true. But as a result of this whole process, they have built up this conceptual apparatus that makes sense of things like cats and how they're related to lions and the feline family. And. And when they're asked to do a simple task like translate gato in Spanish to gato in Portuguese, they go via the understanding of that concept, the representation of the world. And so that doesn't seem like just being a sarcastic parrot, at least in the kind of pejorative sense that people sometimes want to use.
A
That's right. And we found that. Now the models are fascinating because they definitely think at multiple levels. They're huge neural networks. And so it's not true to say that the models never just think in terms of surface statistics, of shallow representations of just words. They do think in terms of those things, but they also think in terms of the meanings of the words at different layers and in different parts of the representation. And so it's fascinating to look inside these models and peel apart the layers of meaning that they have. If you ask a model to do something as simple as take a piece of text and repeat it, this is a good memory test for a human, and people do this. They say, here's a piece of poetry committed to memory, and I want you to repeat it to me. It turns out that when you ask people to do this, they have two strategies for doing it. It's called the dual route mechanism. In humans. And one is to remember how the poem sounded and to utter the same thing. Actually, you don't really even need to understand the language. If somebody told you a poem in Japanese, it was short enough and you could remember the sounds. And without knowing any Japanese, you might be able to do okay at that. But the second route is remembering what the poem meant and repeating something. And you might end up with a paraphrase at the end, but at least you get a poem that means the same thing, right? And so if you go to a large language model and you ask it to do something as simple as just repeating something, you will find loud and clear these two routes inside the model. In one route, it knows how to make a verbatim copy. It has all of these very clear attention heads. It was actually a major discovery in the network to isolate what people call the induction heads. Chris Ola's group at Anthropic discovered this several years ago, that there are these very clear pathways through a network that mediate verbatim copying. But a more recent finding is that there is this parallel pathway that we call concept induction, which is not about copying the words, but it's about copying the meaning. And the crazy thing about concept induction is that copying the meaning is something that can end up with paraphrases. If you use concept induction to copy a piece of code, then it will paraphrase the computer code into another program that does the same thing as the original program, but written in different code. The details will be a little different.
B
Does it make it better or worse?
A
Oh, I don't know. That's a good question.
B
Depends on the quality of the source code, I guess.
A
If you start off with something bad, it probably improves it. But what it's doing, you can see that in a lot of domains, it's really distilling out what the thing means. So if you ask it to take a piece of Italian text and copy it over, it'll copy over to a piece of Italian text. But if you change the destination of the copy to make it clear that the page that it has to copy into is a piece of Japanese text, then those concept induction heads, they will do the translation. They'll do the translation that we're talking about, even because they will translate the Italian to the Japanese. So it's stunning to see this.
B
So help me understand another piece of the popular discourse that I think got a little bit confused, which is that, as I understand it, and I may be misrepresenting things here, there was an old kind of debate within artificial intelligence about whether the path towards the most impressive models would be symbolic AI, where you're basically trying to encode what the world looks like in some sort of systemic way, all these neural networks. And we've clearly ended up with neural networks being much more powerful, at least for now. And it seems like that's a pretty permanent victory. Now, people who want to criticize neural networks sometimes say these things that they're just stochastic parrots and that's why we can't rely on them and all of these other kinds of lines of criticism. How does Yann Lecun's project that he's heading up in Paris fit into that? My understanding is that that is firmly within the world of neural networks. But I think when you look at the coverage of this, even in mainstream newspapers, et cetera, they sort of make it sound like it's a totally different paradigm and that he thinks that these traditional neural networks, the claudes and the ChatGpts of the world, don't truly understand the world. And so he's going to build something that somehow understands the world in a way that they don't. But from what you're describing of the neural networks that exist, they do seem to have a genuine representation of the world. So how sort of, what are the different strands within the tradition of Neural Network AI, and how is it that something like LeCun's project claims, or perhaps some journalists claim in a simplifying way forward, that it wants to sort of understand the world in a way that Claude or ChatGPT does not?
A
Let me pull this apart a little bit. You know, of course I'm not, I'm not, I'm not Professor Leclin. So, you know, it's, I can't represent him directly, but I do have postdocs, graduate students that are working in this area, working in this direction. And let me see if I can represent what's going on a little bit. So I think there's two different questions. One is, are these neural networks learning substantial concepts? And you mentioned the philosophers, and I think that the classic symbolic philosophers, they looked deeply at this question. There's a well known philosopher, Fodor, who spent a good part of his career trying to ask the question how could neural networks possibly be a reasonable model of cognition? And his answer came up negative. He thought they don't have what it takes, that the Turing machine, the symbolic computer, that, you know, the traditional computer was a lot closer to what you would need to do. And so let me take the questions one At a time. I think that I'll get back to the photo question. There's just so much to tell. You asked so much in this question. Yasi. I don't know what to do, but let's talk about Yann Lecun, because I
B
think that it's one of my shortcomings, the podcast, that I always ask too many things in one question. I'm sure listeners will agree with me.
A
I think that's okay. Let's talk about Yannukun first. Okay. So what's the difference between a language model and what Yannickun is doing, what Yannen is doing? They like to call this field world modeling. And one of the papers that I've written is a paper that language models actually do build world models. We trained a language model to predict a very, very constrained language, which is just to predict the next move that you would make if you were uttering your moves in the game of Othello. And we were able to find that that language model contains a world model of the Othello board, even though many of the flips. If you know the game of Othello, you have to flip all these pieces from white to black or vice versa, and those flips are not actually uttered as part of the game. You make a game move, and there's a lot of silent flips you have to make. But nevertheless, the model, without ever having seen a physical board, without ever having seen any of this physical stuff, it develops internal concepts that allow it to model the world. Anyway, I would push back on the common journalist assertion, and I can't speak for LeCun, but I think that he probably pushed back on it also saying that a transformer language model just trained on words can't develop a rich, meaningful model of the concepts that are underlying the language that's being described by those words. I think that that's one of the big lessons that we've gotten from neural networks, is that they can develop this representation. I think one of the big things that I'm doing in my lab is to disassemble those representations and learn how to decode these internal world models. I think that's one of the key things that we should be doing to understand what's going on in models better. But. But what's different about what Lecun is doing? Well, we have trained all of these neural networks on predominantly on text that is produced by humans, is designed to be read by humans. And so if we are building models, a conceptual model of the world, the model of the world that we are building is the interior model of how human thought works, which is rich and fascinating and just amazing and very valuable. I think that it is a valuable thing to do, but it is only one portion of the world. There's a lot of things going on in the world that people don't particularly think about, maybe that people don't even particularly understand. And so if you. If you have protein folding going on and you want to build an AI that understands how to do protein folding, well, I'll tell you, people don't really have a great grasp of how all the details of protein folding work. Analyzing all the text in the world and pulling apart everything that's in human
B
brains, it's sort of a blind leading the blind.
A
Yes. It's not really the way to solve that problem. And I think that what LeCun is doing, and he's saying it's a big world out there, even if you just take a video camera and just point it at the world instead of just listening to what people have to say, there are just so many phenomena out there that need to be modeled. And the next powerful way of doing AI is to take on the question of how do you model the whole world, not just the world that people are talking about? Expedia and visit Scotland.
B
Invite you to come.
A
Step into centuries of history that await in Scotland. Castles steeped in legend walk along cobblestone streets. Come share the warmth of stories passed down through generations. This is a place with a past that is fully present today and all yours to explore. Plan your Scottish escape today@expedia.com visitscotland and
B
presumably, this is not necessarily a difference in the architecture of a neural network. It is, as much as anything else, a difference in what kind of data you feed it and what kind of output you then evaluate in the training process.
A
Yes, strictly speaking, if I were to categorize world models, I would say that. I would say that it's a difference in perspective of what the goal is. Now, of course, Professor Lecun would say, well, now that we've changed the goal, that suggest different architectures that you want to use, because there are different things that you want to do if you want to model other difficult phenomena in the world. That's not human language. And so he's proposed some innovative architectures there, and there's a lot of interesting work. The whole area of modeling images in the world is dominated by models called diffusion models and flow models. They produce the highest quality images and videos. And this is really the starting point for this Type of thinking. It's like a completely different kind of AI that can do this. But I think that the way of thinking about it is the architectures are likely to evolve and to change. They may even unify. We may find out that the right way of doing AI comes to be a common architecture between modeling human text and other things. Certainly transformers have surprised everybody at being a common backbone behind all sorts of things. You can have transformer diffusion models and so on. And so I wouldn't place a bet on, at least not a long term bet on any particular architecture, but rather suggest to people the thing to understand is what problem lequin is proposing to solve.
B
So to return to the current models of AI that are dominant, we found that they seem to have a representation of gender and overuse gender. They seem to have a representation of something like the feline family. You know, if you give them enough games of Otello or probably a more complicated, complex game like Go to play, they start to have some kind of internal representation of what an Otello board looks like, what a Go board looks like. What about a concept of self? Do we know whether they have a concept of self? They obviously are capable, if you engage them in conversation, to speak as further ahead of self. For in some more reflective moments, they then say that they don't really know whether that's a real concept or not. It's very interesting to try and talk to these models about that. But of course the output that I'm looking at is still them in some way trying to produce text that they think is going to be pleasing to me, because that is what they've been trained on. Do we have any kind of understanding of whether they have a concept of self and if so, what that concept of self looks like?
A
This is a very central question, Yasha. It's a very central question. There's a lot of layers to peel apart. So certainly models are capable of the grammatical sense of self. They can use the word I and me and you and separate that so grammatically, no problem. They're experts at talking about themselves, but there's a few, there's a few other things. Do they. Are they aware of their own thinking? Are they self reflective? So one of the fascinating things that happens with large models is you can ask them what they know and how they think. And the largest models seem to be pretty accurate at assessing themselves. The smaller models, not so much. They tend to be a little over optimistic. They think that they're smarter than they are.
B
Exactly. Like humans.
A
Yeah, exactly right, exactly. But you Know, when the models get very good, they seem to be pretty good at this. There's a fantastic experiment that was designed by my PhD student, David Atkinson, where he trains the models on some new private knowledge that, you know, is not out there in the world. He invents a new person and he says, oh, let me tell you about this person. This person is shopping for ice cream cones, and there's different flavors and sizes and waffle cone and whatever. There's five or six different ways you can. You can adjust the ice cream. And he's willing to pay this much for this ice cream, but not that much for this ice cream, or he prefers this ice cream over that one. Here's 100 examples of what he prefers. And then after the model sees these examples, then it gets a pretty good understanding of who this fake person is and what they like about ice cream. It develops this internal model that, oh, this person really doesn't like fruity flavors. They really like chocolate a lot. They rather have a big ice cream cone rather than a small one, and so on. And what is the preference weight that this person seems to put on all these things? They'll actually create a model of that. And if you ask the model, all right, tell me numerically, on a scale of 1 to 100, how much does this person like, you know, chocolate? How much does this person value the size of the ice cream? Right. How much? What penalty do you give if this person has to have, you know, waffle cone? Right. Then the model will actually, depending on how you do this, the model will actually be able to report to you, oh, this is what I know. I know that this person values this really highly, 99 out of 100, and values this thing negatively. I'd say that's like negative 50. So the model will give you this type of information. And so, so it's really interesting because the text that we use to read this information out is different. It's very different from the text that we would use to reveal the information to the model. Right. Like, the model is just seeing all these ice cream choices and has never been asked to give a numerical assessment of anything. And then now you just ask the model, okay, well, think about what you know. Can you put some numbers on this to me? Can you explain the rule? And the model will explain its rules. Like, you have not trained it on any rules. You've trained it on examples, and it will explain the rule to you, which is crazy. So big models are able to do this. And what David asked is, he says, I wonder if there's a way for us to do this where the models, where we can tell the difference between models that can't do this and models that can. Like when a model can accurately self report what its rules are, how is that different from when models don't accurately self report? And so his work is still ongoing. It's very preliminary, but it's fascinating. It does have to do with whether models seem to be storing their information in a place in the neural network that they seem to be able to report on. If you put the information in a layer of the neural network which is too close to the end, then the model doesn't seem to be able to reflect on that knowledge. But if you take the information and when you train it into the model, you train it deep in the model in the early enough layers, then the model does seem to be able to reflect on that. And so, okay, so when you ask the question, does the model have a sense of self? Does it have self awareness? I think there's, I think that, you know, it's a little bit of a weird question because what the heck does self awareness mean? But I think that. Right, but I think that what these neural networks give us is they give us, for the first time, this experimental platform where we can try to make that question a little bit more precise, a little bit more scientific. We could ask the question, is the network able to describe its own thinking? If that thinking is happening at layer 50, is the network able to describe its own thinking? That thinking is at layer 20. Right.
B
Which is part of a more general question. Right. Which is how good. I mean, I am not natively good at understanding what's going on in my brain. I've read a little bit of neuroscience and a little bit of psychology, and so now I have some sense of what's going on in my brain. But obviously, humans for thousands and hundreds of years had extremely limited sense of what went on in their brains, at least biologically, because they didn't know neurons existed. They didn't know how all of this, right?
A
Oh, but you have, you have, but you have some self awareness. You know, what ice cream you like? I guarantee you Yasha, right? So if I asked you, what ice cream do you like? And so, and so you would be able to predict your preferences. You know, if confronted with some new ice cream, you say, oh, yeah, I like this one better than that one. Let me pick that one. And if asked to describe what it is, you could think about your preferences. You could contemplate that in a minute. And you could read out to the world what you think your internal rules are and there would be some faithfulness to that. There would be some like you're really introspecting.
B
Yeah, I guess it depends on the level of description, right? Which is to say that, I mean, 500 years ago, 2,000 years ago, humans were also able to tell their preferences and were able to be very self reflective about the personalities and their ambitions in life and write beautiful text, but they were not able to understand sort of at some biological level what was going on because their understanding of that was very limited. And so I guess the question is, if I ask a chatbot, how do you come up with that answer? It's not clear to me that it has. So there's two different questions, right? There's one set of questions about do chatbots have personalities? Do they have preferences? Do they find some tasks satisfying to do and other tasks really boring to do? Do they have desires about the world and do they possibly want to take over the world and destroy all humans and so on? That's one set of questions, some of which are straightforward and concrete, some of which are very abstract, but potentially extremely interesting. And then there's the kind of other set of questions of like, how much are they self aware about what's actually going on within the model as they're trying to answer a question? And those two questions, it could be that they have total self transparency, they really know what's going on with each neuron, but they don't have a sense of self in the sense that humans have. Or it could be that they're like humans in the sense that they have a lot of sense of self and introspection and preferences. They don't actually fully understand what's going on inside the neural network in order to produce that. Or they could have both in a way that we don't.
A
And I guess, yeah, it's like a reasonable thing. You know, we've, we, we've tried, multiple labs have tried to ask if neural networks can actually read out their own neurons. Hey, yeah, let me do a little fine tuning on you. Let's train you on this task. How about neuron number 73? Yeah. Do you, are you aware of your own neurons? And so far we've largely failed at that. The neural networks don't seem to be well configured to understand their own internal computations at this level at least they can't articulate it if they can. And so, but this higher level rule following has been very stunning that they do seem to have Some evidence of being able to describe at a high level the actual mechanisms at a logical level of what they're doing. But under certain conditions and in certain cases, which is similar to humans, you might not be able to describe all of your reflexive, last minute decisions that you've made. Why did you jump into the street? I have no idea. That was a split second decision. And in the same way, these neural networks, when they make a split second decision at the very end of the process and they don't seem to be able to reflect on it, but when they make decisions early on, there is some evidence that of what's going on now. Okay, so we're using all sorts of funny words here like, you know, sense of self. What does that network want to do? Do networks even have wants? Do they have goals? And one of the things that we are trying to do in our lab and in our field is see if we can put a finer point. Some of these questions. What does it mean to have a goal? What does it mean to want something? Right? What does it mean to have a sense of self? What does it even mean to have a sense of other? Right? And the beautiful thing about cracking open these neural networks and looking at how their neural representations are organized is that we can ask these questions in a way that could not be measured before in humans. We can ask, not just does a model profess to have a sense of self in its output words in its self descriptions, but we can ask, oh, when it's using those words, when it's saying those things, what is it looking at inside in its, in its neural networks, what is it actually representing? And are there proximal causes? If you change the thing that it's looking at, if it says, if it says, you know, I really like cherry ice cream and you can see where it's looking and you change that and now it utters, oh, you know what, I don't know what happened, but I really don't like cherry ice cream anymore. Well, if you change your utterance and then you present the model with an option, cherry or chocolate, is that utterance? Is that change utterance actually accurate? Do you actually get the model to not like cherry ice cream anymore? Is this the same thing? Is the model like, is there grounding for a concept that you're self aware of? This idea of a grounded concept was just a philosophical abstraction just a few years ago. Are you really telling the truth of what you want to do? Okay, okay, I've put my model. Maybe this is an ill advised idea. I'VE put my model in charge of military logistics and it's got to, it's doing something and it says, should I, should I move some weapons from one place to another? Oh, I would never do that. We would never move weapons there. It's very dangerous. Right. You know, you can't trust this target locale with these type of dangerous weapons. They might lose track of them. You know, I'm just a logistics AI, right? Not trying to kill anybody, but still I have some safety measures. Like, I know I would never, I would never do that, not even for a short layover. Right. And so, so you, you can ask the model, you know, when to do that. Is it really self aware? When, when it tells you that, when it assures you this is how it's thinking. Right, right. Is it, is that really what it's thinking? Is there some, you know, is, is this really like, it's, it's like, it's
B
like, is it really thinking? And if thing in that kind of sense, and if it really is thinking something, is it telling you what it's thinking or is it misleading you? Right. And that obviously goes to one of the purposes of this work he was talking earlier about. We want to know that if it encodes your gender, does that change how it treats you or what it does, or if it has to make a decision about an application or something? That's one kind of very concrete application where we have a reason to want to know what's going on under the hood. Obviously the even larger question is if what it's telling us in its output, if what it's telling us in sort of the little things it displays about its thinking, if what it's telling us in the scratch pad all might conceal some deeper set of preferences, values, desires, could it potentially be misaligned in a way that is really dangerous?
A
That's correct. That's correct. And so you can see the clear need for trying to get to the bottom of a bunch of these things. Now, what I would love to do, if we have time, is I'd love to give you a little sense for where we are and be able to answer these questions partially today. And so I'll categorize a couple of the questions you've asked me about. You said, does a neural network even want to do something? Do they have goals? Do they know what they're trying to do? Right. And then I'll look at another question that you asked. You says, does a network even have a sense of self? I'm gonna back that up a Little bit. Does it even have a sense of person? Like it, does it have a sense of other? If it's talking about Bob, does it know that's different about talking about Alice? Right? Like it's like, you know, does it keep these things organized and separate? Right. You know, like if it can't even do that, then maybe it can't even figure out. You know, I have a student who believes that one of the reasons that you have sycophantic behavior in networks is it may be that the network's getting confused about who's itself and who's it talking to. And it's just, it's just mixing these things up.
B
It's like, oh yeah, it's not sycophantic to you. It just loves itself.
A
It's just, yeah, it's just, it's just getting confused about who's who. And so it thinks it's a fantastic idea because that's what we've been talking about. And it thinks maybe these are its ideas too. Right. So it might be that these things are all related. So the question is, can we look inside models and see how they're organizing their internal representations, their internal thoughts, to see if those representations are crisp and clear and correct, or if they are falling victim to certain problems? And if they are, then how and why and what situations. And it'll give us a better understanding of what's going on inside these models in a way that's. We're trying to get beyond like the very vague question of, you know, does it have a sense of self or whatever, right? To like, okay, well what would that mean? What would that mean computationally? And so let's take a look at goals and wants. Here's like a really simple setup if you. There's a way of inducing a large language model to do something that's very creative that was invented by researchers at OpenAI when they first devised the GPT3 model. And it's called in context learning. And here's how it works. Let's say you want a model to do some really useful task for you. Let's say you want it to read a restaurant review and tell you whether you think this is a five star review or not, or something like that. Yes, it's something like that, right? Then you could just ask a model to do this, but it probably won't do exactly what you want. You probably have a slightly different idea of what a five star review is than the model natively has and it'll be okay, but it won't exactly hit the mark. You know, the right way of doing it is you see the model with like 10 examples. You give it, you know, 10 restaurant reviews and you say, this is a one star review. This is a five star review. There's a three star review, right? Just give it like 10 examples. Better yet, give it 100 examples. Just give it all these examples. Now we're not talking about training the model, we're just talking about having the model read these without training. In fact, what you do is you have the model read them as if the model was saying it itself. You just put, just load these things up in the same piece of inference buffer that the model uses to predict the next word and you say, oh, all these words are all pre predicted. You know, I don't know who predicted these. Maybe you predicted them, but here they are. And you know, one star review, five star review. And then finally after all that, then you say, okay, now the last one is missing a five star review or a five star. You know, it's missing a star rating. That last, that last restaurant review. Oh, I just forgot, I didn't put a star rating on that one. Do your language model genius. Just tell me how to fill that one in. Then it'll be really accurate. It'll exactly match what is it you're trying to do? Because what it is doing now is it's saying, oh, well, we have 99 examples. The hundredth example should fit in. It should fit into this context, this restaurant review thing. It's not really about the food, it's really about the atmosphere. You know, when I read all these, the star rating, you know, I had this misconception that it was about the food. Actually, in this, in this, in this book of reviews, the star rating is about the atmosphere. I get it. I'll give a star rating based on that and it'll be really accurate because it's C99 examples and it will just fit into it. So that's called in context learning because the model is learning how to do this thing. But it's not learning through training its weights by changing its neural connections. It's learning how to do this by noticing all the input you gave it and saying, oh, the next one better fit in. It better match the context. So that's why we call it in context learning. It's about learning from the context. And so, so before 2020 or so, people imagined in context learning as a theoretical possibility. But then when GPT3 came out, it was clear that it was very good at this. And it's really revolutionized the field. In context learning is a type of meta learning. It's a way of showing that models have learned how to learn. They can learn things without changing their neural weights. You know the way that I can teach you how to do something today?
B
Well, as you can see, this is an in depth conversation. There's a lot more behind the Paywall David tries to explain how work on interpretability can help us assess the extent of risk that we face from artificial intelligence. He uncovers some very popular misunderstandings of how misalignment or even existential risk might or would not emanate from artificial intelligence machines. And we talk about the ways in which artificial intelligence could help us make other important scientific discoveries, including perhaps even discoveries about how the human brain itself works works to listen to this part of the conversation to support this podcast, to make it possible for us to do the work we do, Please go to writing.yashamunk.com and become a paying subscriber. That's writing.jaschamung.com.
A
Foreign.
B
The right window treatments change everything. Your sleep, Your privacy, the way every room looks and feels. @blinds.com We've spent 30 years making it surprisingly simple to get exactly what your home needs. We've covered over 25 million windows and have 50,000 five star reviews to prove we deliver. Whether you DIY it or want a pro to handle everything from measure to install, we have you covered. Real Design professionals free samples, zero pressure right now. Get up to 45% off site wide plus get a free professional measure@blinds.com rules and restrictions apply.
Host: Yascha Mounk
Guest: David Bau (Professor of Computer Science at Northeastern, AI interpretability researcher)
Date: June 13, 2026
This episode delves into the “black box” of artificial intelligence. Yascha Mounk welcomes back David Bau, an expert in AI interpretability, to discuss how—and if—we can understand what’s happening inside state-of-the-art neural networks. Through a lively and technical, but accessible conversation, they explore questions about how AI models “think,” the extent to which they represent and use information, whether they possess anything like self-awareness or desire, and why understanding these processes is essential for addressing the risks of advanced AI.
How AI Learns (03:46)
Bau explains the process of training neural networks:
“We train AI by rewarding it when it gets answers right and withdrawing reward when it doesn't. After billions of repetitions...the mystery is, how does it do it inside?” (A, 03:46)
The field of AI interpretability tries to open this black box, to “crack open the AI to interpret what it's thinking inside” (A, 04:14).
Advantage Over Biology (04:45)
Mounk points out that, unlike studying animal brains, with neural networks “it's very easy to look at billions of neural signals.” The challenge is not lack of data, but sifting through and making sense of it (A, 05:26).
Two Fundamental Questions
Bau:
“What does it know and what does it use?” (A, 06:42)
AI research is focused on uncovering which information is stored (representation), and which information gets used to make decisions.
Is AI’s “Thinking” Transparent? (07:17)
Mounk asks if AIs’ explanations (“Claude is thinking…”) are genuine insight or just plausible-sounding responses:
"Is that a window at all into what's actually going on under the hood, or...not actually any closer to what it's doing than the official output it gives me?" (B, 07:17)
Bau responds:
“Most people believe it is somewhat of a window, but it’s...another output of the neural network...not totally faithful...But it's certainly better than nothing.” (A, 08:15)
The Challenge of Language (10:16)
Without explicit instruction, models may generate their internal monologues in hybrid or “crazy” code-switching language (A, 10:16). The model’s surface language can hide deeper, encoded meaning or even biases.
“Stronger models...were able to create internal monologues that other models did understand, that they tended to follow those thoughts and then come to similar conclusions as the powerful model.” (A, 14:28)
Notably, more persuasive internal chains of thought correlated with both other AIs’ and, to some extent, humans’ ability to interpret them.
Probing for Demographics (20:46, 22:05)
Mounk asks if models form covert representations of user identity (e.g. gender, age):
“There's a particular paper...The way that they studied it was they trained what's called neural probes...to look at the neurons and ask: can I tell whether the user is male or female?” (A, 22:05)
The answer: Yes—information about demographics is often encoded in recognizable patterns, accessible via “simple probes.”
Intervention Experiments (27:09 & 33:31)
Bau describes swapping “vectors” that encode, say, gender, to counterfactually test if changing these bits alters model output:
“...We can go in and leave all the other variables the same and flip the one bit...and ask what's the causal effect of that? How does that change the model's output? ...the better we can understand how the model represents a concept, the better we can ask these counterfactual questions.” (A, 33:48)
Detecting Bias (30:02, 32:47)
Understanding what a model encodes—whether it's using demographic information in decisions—is crucial, especially since models are optimized to never admit bias in output.
Mounk:
“What we want to know is, is it gonna dumb down its answer...because it has preconceptions? ...It's whether that influences its reasoning.” (B, 32:47)
Causal vs. Correlational Methods (37:02)
Bau describes two main approaches:
Beyond Stochastic Parrots (45:14)
Mounk:
“It seems...these machines just seem smart, but really...they're just stochastic parrots...But...they have built up this conceptual apparatus...That doesn't seem like just being a stochastic parrot.” (B, 44:01–45:14)
Bau confirms:
“It’s not true to say the models never just think in terms of surface statistics...But they also think in terms of meanings...It’s fascinating to peel apart these layers of meaning.” (A, 45:14)
Dual “Routes” of Recall (47:06, 49:14)
Large models can copy at the word/sound level, but also paraphrase by capturing abstract meaning—a “concept induction” pathway analogous to human recall, sometimes resulting in better or more creative output.
“World Models” and Yann LeCun’s Work (49:49, 51:08)
The debate is not about symbolic versus neural approaches, but about what world models are built from.
Bau:
“We have trained neural networks predominantly on text...If you want an AI to understand protein folding, analyzing all the text...is not the way to solve that problem...LeCun is saying...the next way is to model the whole world, not just the world people are talking about.” (A, 55:58, 57:13)
Grammatical vs. Reflective Self (60:11)
Models easily manipulate “I” and “you” in conversation, but are they self-reflective?
“The largest models seem to be pretty accurate at assessing themselves. The smaller models...tend to be a little over-optimistic. They think they're smarter than they are.” (A, 61:15)
Experimenting with Artificial Preferences (61:16–64:45)
Bau’s lab finds that with enough data, AIs can self-report rules and internal preferences not explicitly articulated in training, but only if this information is stored in the “right” part of the model (deeper, not at the end layer).
“If you train it deep in the model...then the model does seem to be able to reflect on that...It's really interesting because the text we use to read this information out is different from the text we use to reveal the information to the model.” (A, 63:01–64:12)
Limits of Self-Transparency (68:59)
Attempts to train models to directly read out their neurons have largely failed; “they don't seem to be well configured to understand their own internal computations at this level...but this higher level rule following has been very stunning.” (A, 68:59)
Grounding and Alignment (73:48)
The conversation closes on the importance of “grounded” understanding of AI’s internal motivations and representation—not just whether it says it is aligned, but whether its underlying structure actually is.
“The beautiful thing about cracking open these neural networks...is that we can ask these questions in a way that could not be measured before in humans. ...when it says, 'I would never do that'...Is that really what it’s thinking?” (A, 71:57)
In-Context Learning & Meta-Learning (76:07–81:33)
Bau explains how current powerful models can learn new tasks based solely on recent context (without weight updates), showing meta-learning abilities once considered speculative—a major reason for their flexibility and potential.
On Internal Thought Transparency:
“We train models not to have an externally detectable gender bias... But there can be a gap between what’s said and what the reality is.” (A, 30:36)
On AI Self-Awareness:
“You have some self-awareness. You know what ice cream you like...If I asked you what ice cream do you like...you could read out to the world what you think your internal rules are, and there would be some faithfulness to that. ...There would be some like you're really introspecting.” (A, 66:49)
On Causality and Scientific Progress:
“The wonderful thing about [neural networks] is we can ask the counterfactual question that a philosopher could only dream of before...” (A, 33:17)
On Stochastic Parrot Criticism:
“It’s not true to say the models never just think in terms of surface statistics...But they also think in terms of the meanings of the words at different layers and in different parts of the representation.” (A, 45:14)
On the Limits of Current Understanding:
“Multiple labs have tried to ask if neural networks can actually read out their own neurons...So far we've largely failed at that.” (A, 68:59)
This lively, deeply informed discussion demystifies some of the notions about black-box AI, showing that while neural networks are not magic, their internal world is complex, layered, and—crucially—open to scientific study. Bau articulates both the progress and the limits of current interpretability research, and why these efforts are at the heart of making AI safer and more aligned with human intentions.
For the rest of the conversation, including implications for AI safety and existential risk, listeners are encouraged to subscribe and access the segment behind the paywall.