Transcript
A (0:04)
All right, we are actually going to record this as a intro to the main episode, but here we have my trusty co host, guest host, I guess Vibu, as well as Emmanuel from Anthropic. We're going to talk about the circuit tracing stuff and all the interpretability work. But Emmanuel, maybe you want to do a quick self intro because before we get into it.
B (0:25)
Yeah, sure.
C (0:26)
I'm Emmanuel, I work on the interpretability team here at Anthropic, more specifically on the Circuit Markets team. So we recently released a pair of papers about sort of like the work that we've been doing over the last months. And even more recently we released some code in partnership with the Anthropic Fellows program. It was mostly built by Anthropic Fellows that lets people play with the research basically. And so happy to talk about that. And we also hope to kind of like keep releasing more things and partner with other groups that are working on similar stuff.
A (0:56)
Yeah, amazing. We'll get deeper into like the behind the scenes on the main podcast, but let's maybe just dive right in into what you release because that's the most topical thing. This is like literally just launched it like yesterday and we'll probably release it and release this episode in a few days. So yeah, what can people do or what do you recommend people try?
C (1:13)
Totally. So like at a really high level, you know, the sort of idea of the research itself is to try to explain sort of like some of the computation that a model did when it predicted a given token. And so in our paper we kind of like show how to do this and then we show examples of us doing this on internal private models. And then the release this week sort of lets anyone do it for a set of open source models. So notably, maybe the most easy one here is like Gemma 2 2B. So you can sort of like think of some prompt and you kind of can explain any token that the model samples and explains here means just basically blow up the internal state of the model and show all of the sort of intermediate things that the model was thinking before it got to the final token that it predicted.
B (2:03)
Yeah, so some of the things that you guys put out is kind of in the circuit tracing. You have a few core examples. Right. So we can see how these models have internal reasoning states and there's like multi hop reasoning. And some of the stuff that we talked about on the podcast was how can people that are interested in how models work kind of do anything? Right. So what are open questions? How can people contribute? And it seems like the follow up is okay, it's been a few weeks, here's a huge library. So, you know, I guess before we even get into it, what are some open questions that you would expect people to like kind of play around with? You know, what, what are people like going to do? What, why should we probe Gemma Llama? What are interesting things we can do and any tips on using it?
