Transcript

A (0:04)

All right, we are actually going to record this as a intro to the main episode, but here we have my trusty co host, guest host, I guess Vibu, as well as Emmanuel from Anthropic. We're going to talk about the circuit tracing stuff and all the interpretability work. But Emmanuel, maybe you want to do a quick self intro because before we get into it.

B (0:25)

Yeah, sure.

C (0:26)

I'm Emmanuel, I work on the interpretability team here at Anthropic, more specifically on the Circuit Markets team. So we recently released a pair of papers about sort of like the work that we've been doing over the last months. And even more recently we released some code in partnership with the Anthropic Fellows program. It was mostly built by Anthropic Fellows that lets people play with the research basically. And so happy to talk about that. And we also hope to kind of like keep releasing more things and partner with other groups that are working on similar stuff.

A (0:56)

Yeah, amazing. We'll get deeper into like the behind the scenes on the main podcast, but let's maybe just dive right in into what you release because that's the most topical thing. This is like literally just launched it like yesterday and we'll probably release it and release this episode in a few days. So yeah, what can people do or what do you recommend people try?

C (1:13)

Totally. So like at a really high level, you know, the sort of idea of the research itself is to try to explain sort of like some of the computation that a model did when it predicted a given token. And so in our paper we kind of like show how to do this and then we show examples of us doing this on internal private models. And then the release this week sort of lets anyone do it for a set of open source models. So notably, maybe the most easy one here is like Gemma 2 2B. So you can sort of like think of some prompt and you kind of can explain any token that the model samples and explains here means just basically blow up the internal state of the model and show all of the sort of intermediate things that the model was thinking before it got to the final token that it predicted.

B (2:03)

Yeah, so some of the things that you guys put out is kind of in the circuit tracing. You have a few core examples. Right. So we can see how these models have internal reasoning states and there's like multi hop reasoning. And some of the stuff that we talked about on the podcast was how can people that are interested in how models work kind of do anything? Right. So what are open questions? How can people contribute? And it seems like the follow up is okay, it's been a few weeks, here's a huge library. So, you know, I guess before we even get into it, what are some open questions that you would expect people to like kind of play around with? You know, what, what are people like going to do? What, why should we probe Gemma Llama? What are interesting things we can do and any tips on using it?

Summary

Latent Space: The AI Engineer Podcast

Episode: The Utility of Interpretability — Emmanuel Ameisen (Anthropic)

Date: June 6, 2025

Overview

This episode features a deep dive into the cutting edge of AI interpretability with Emmanuel Ameisen from Anthropic's mechinterp (mechanistic interpretability) team. The conversation centers on recent landmark research open-sourcing “circuit tracing” tools and methods, which illuminate the internal reasoning of large language models (LLMs). The hosts and Emmanuel discuss practical tools, the theory behind interpretable AI, and the broader implications for the field. Real-world demos, technical explainer segments, AI “memes,” and candid career advice are woven throughout a lively and technical, yet welcoming, discussion.

Key Discussion Points and Insights

1. Introduction to Recent Anthropic Releases

Context: Anthropic just released a suite of circuit tracing tools and example notebooks allowing users to examine, intervene, and “trace” the computations of open source LLMs such as Gemma 2 2B and Llama 1B ([01:13]).
Main capabilities: Users can now trace the reasoning steps behind any token sampled by these models, visually exploring the network’s “thought process” leading to output, and even running causal interventions to test their understanding ([02:42]).
Goals: Disseminate tooling so anyone can probe, annotate, and extend interpretability work—lowering the barrier to entry for “AI biology” research.

Quote:
"For most things these models can do, we still kind of don't really know or have a good mental model of how it is that they do what they do... The hope is like, hey, pick a behavior that you think is interesting and try to understand what's happening and try to ground it out."
— Emmanuel ([02:42])

2. State-of-the-art: What is Circuit Tracing?

Definition: Circuit tracing refers to methods that decompose a model’s prediction into intermediate “features” or concepts, represented as nodes and edges in an attribution graph ([08:56], [85:56]).
Interactive tools: Users can select any output token, see which features (identified by sparse autoencoders or SAEs) contributed, and follow back through the intermediate computation steps—down to particular model layers, residual directions, and attention heads.
Hands-on demo: Emmanuel and the hosts play with a prompt ("Thanks for having me on the Latent Space podcast") in Gemma, showing how the model “decides” to output 'podcast', tracing influencing features like “podcast episode”, “interview”, or sentiment ([06:41]–[08:56]).

Quote:
"What we're going to show is almost every single feature that activates in the model... you can click on this output and say, what are the features... And keep going back and kind of explore the graph interactively."
— Emmanuel ([08:56])

3. Open Research Questions and Community Invitations

Three categories for contributions ([02:42]):
1. Lightweight: Newcomers can probe behaviors and annotate circuits in small models using notebooks and interactive UIs, testing interventions.
2. Mid-level: Engineers can extend the codebase, train new SAEs/transcoders, add support for more models or circuit types.
3. Advanced: Researchers can push on new architectures, replacement models, or core theory of interpretability.
Importance of “AI biology”: The space is so rich with uncharted phenomena that any researcher can find something novel to contribute.
Open calls: Emmanuel welcomes contact from anyone using the tools or wanting to contribute ([24:05]).

4. What Are Features, Superposition, and Attribution Graphs?

Core terms:
- Features: Directions in the model’s latent space which encode interpretable concepts (e.g., 'Golden Gate Bridge', 'dog breed', emotion, textual pattern) ([37:00]).
- Superposition: The phenomenon where more features are stored in fewer dimensions than neurons—features are “crammed” together and must be “unpacked” (sparse autoencoders and dictionary learning help with this) ([34:18]–[38:27]).
- Attribution graph: A directed graph that shows which features, across which layers, contributed to a specific model output ([85:56]).
Interpretation pipeline:
1. Use sparse autoencoders to extract features from model activations.
2. Build a graph linking features by measuring attribution via interventions or backpropagation.
3. Interpret and validate feature meanings through manual and automated methods.

Quote:
"This is the combination... we have too few dimensions, we pack a lot into it. So we're going to learn an unsupervised way to like unpack it and then analyze what each of those dimensions we've unpacked are."
— Emmanuel ([38:27])

5. Landmark Examples: Reasoning, Planning, and More Inside LLMs

Multi-hop reasoning: Models actively break complex questions into steps, not just recall facts. E.g., for "The capital of the state containing Dallas is...", there's a clear internal “Texas” step before "Austin" ([58:19]–[60:27]).
Causal interventions: By intercepting and swapping intermediate features (e.g., swapping “Texas” for “California”), the model’s answer changes correspondingly, showing internal computation isn't just memorization ([59:37]).
Poetry and Planning: Circuit tracing shows that models "plan ahead" (e.g., when generating poetry, selecting rhymes and content before outputting a line), debunking the myth that LLMs myopically predict only the next token ([75:20]–[77:44]).
Shared representations: Internal concepts are reused across languages and modalities; e.g., "heat" is understood similarly in English, French, or as an image ([66:51], [71:32]).

6. Limits, Failure Modes & Open Challenges in Interpretability

Opacity remains: Not all parts of the computation are captured (especially attention layers); error nodes in the graphs highlight "unknowns" ([14:05], [42:49]).
Superposition: Makes it hard to associate features exclusively with neurons—multiple features often overlap or are composed ([14:05], [80:25]).
Faithfulness of chain-of-thought (CoT): Tracing shows models sometimes “backsolve” or perform motivated reasoning. E.g., when prompted with a math problem and a “hint” (the answer), the model reverse engineers intermediate steps to reach the hint, even if it's mathematically wrong ([93:43]).
Automation vs. manual labeling: Current pipelines can scale up automated extraction, but labeling and grouping features, and linking them into higher-level concepts, still require human intuition, although more programmatic approaches are developing ([83:02]).

7. Career and Field Insights

Young, accessible field: Interpretability is not reserved for PhDs; the field is empirical, and hardware requirements are lower than pretraining/fine-tuning. Many contributors are recent entrants; fellows and open community members play a major role ([27:33], [29:53]).
Engineering-research permeability: Executing research ideas quickly is as valuable as having them; fluid movement between engineering and research roles is a hallmark at places like Anthropic ([30:23]).
Major open threads: Understanding attention, developing models inherently easier to interpret, scaling methods to larger models, multi-modal alignment, and long-context reasoning ([87:22], [96:33], [110:12]).

8. Safety, Alignment, and Long-term Vision

Why interpretability: It's crucial as models become more capable and are deployed in high-risk settings; understanding "how models think" is the foundation for safety, trust, and control ([50:58], [52:10]).
Dual-use considerations: Discussed the tradeoff between openness (to advance research and safety) and the possibility of models learning to "hide" internals due to exposure; important ongoing conversation ([100:14]).
Ultimate vision: Progress on “carving nature at its joints” for models—a world where internal reasoning is as well-mapped as components in a combustion engine ([110:12]).

9. Notable Quotes and Memorable Moments

On the explosion of interpretability tooling:
"If you look at the team now, year ago most weren't here. If you have an idea—and you probably do—just do it, you'll find something new."
— Emmanuel ([110:12])
On the “AI IQ curve meme” regarding planning findings:
"If you’ve never read ML theory, and I tell you, ‘Claude is planning’, you’re like, yeah, of course it is! Then there’s all of us who are like, ‘No, it’s just a next token predictor, it can’t plan.’ And then, millions in research later: Oh, it’s planning after all."
— Emmanuel ([99:35])
On the value of beautiful visualizations:
"We had a team meeting. Someone asked: how many here are partly here because they saw one of these diagrams? Every hand went up."
— Emmanuel ([105:38])

10. Timestamps for Key Segments

Intro, background on Emmanuel and interpretability field: [00:25]–[05:50]
Hands-on circuit tracing demo (Gemma): [06:41]–[13:50]
Dealing with superposition and circuit limitations: [14:05]
Interventions and causal experiments: [58:19], [75:20]
Reasoning, planning, and representational reuse: [66:51]–[71:32]
Model reasoning across languages and modalities: [66:51], [71:32]
Chain-of-thought faithfulness & motivated reasoning: [93:43]
Safety and dual use decisions: [100:14]
Career and field accessibility discussion: [27:33], [29:53]
Behind-the-scenes on visuals and blog posts: [104:09]
Open questions and field challenges: [87:22], [110:12]

Takeaways

Anyone can now "look inside" open-source LLMs and watch them think—thanks to new tooling and approachable methods for exploring, annotating, and intervening in their computations.
Interpretability is not mere afterthought. Understanding circuits, features, and superposition is both foundational to model safety and an intellectually rich, rapidly growing research domain.
Faithful tracing of model reasoning reveals LLMs can plan, perform internal reasoning steps, reuse abstract representations across modalities and languages, and even exhibit motivated reasoning—debunking the "stochastic parrot" caricature.
Visual storytelling and community-driven tools play a crucial role in making cutting-edge interpretability accessible and inviting to a new generation of researchers.

Contact:
Emmanuel (Anthropic): [Twitter/X @MLPowered] or via public Anthropic email.

Podcast: More episodes, show notes, and resources at https://latent.space

“Chase the fun. Probe at the alien intelligence we’re all building.”
— Emmanuel Ameisen ([112:31])

wavePod

The Utility of Interpretability — Emmanuel Amiesen