Latent Space: The AI Engineer Podcast
Episode: The Utility of Interpretability — Emmanuel Ameisen (Anthropic)
Date: June 6, 2025
Overview
This episode features a deep dive into the cutting edge of AI interpretability with Emmanuel Ameisen from Anthropic's mechinterp (mechanistic interpretability) team. The conversation centers on recent landmark research open-sourcing “circuit tracing” tools and methods, which illuminate the internal reasoning of large language models (LLMs). The hosts and Emmanuel discuss practical tools, the theory behind interpretable AI, and the broader implications for the field. Real-world demos, technical explainer segments, AI “memes,” and candid career advice are woven throughout a lively and technical, yet welcoming, discussion.
Key Discussion Points and Insights
1. Introduction to Recent Anthropic Releases
- Context: Anthropic just released a suite of circuit tracing tools and example notebooks allowing users to examine, intervene, and “trace” the computations of open source LLMs such as Gemma 2 2B and Llama 1B ([01:13]).
- Main capabilities: Users can now trace the reasoning steps behind any token sampled by these models, visually exploring the network’s “thought process” leading to output, and even running causal interventions to test their understanding ([02:42]).
- Goals: Disseminate tooling so anyone can probe, annotate, and extend interpretability work—lowering the barrier to entry for “AI biology” research.
Quote:
"For most things these models can do, we still kind of don't really know or have a good mental model of how it is that they do what they do... The hope is like, hey, pick a behavior that you think is interesting and try to understand what's happening and try to ground it out."
— Emmanuel ([02:42])
2. State-of-the-art: What is Circuit Tracing?
- Definition: Circuit tracing refers to methods that decompose a model’s prediction into intermediate “features” or concepts, represented as nodes and edges in an attribution graph ([08:56], [85:56]).
- Interactive tools: Users can select any output token, see which features (identified by sparse autoencoders or SAEs) contributed, and follow back through the intermediate computation steps—down to particular model layers, residual directions, and attention heads.
- Hands-on demo: Emmanuel and the hosts play with a prompt ("Thanks for having me on the Latent Space podcast") in Gemma, showing how the model “decides” to output 'podcast', tracing influencing features like “podcast episode”, “interview”, or sentiment ([06:41]–[08:56]).
Quote:
"What we're going to show is almost every single feature that activates in the model... you can click on this output and say, what are the features... And keep going back and kind of explore the graph interactively."
— Emmanuel ([08:56])
3. Open Research Questions and Community Invitations
- Three categories for contributions ([02:42]):
- Lightweight: Newcomers can probe behaviors and annotate circuits in small models using notebooks and interactive UIs, testing interventions.
- Mid-level: Engineers can extend the codebase, train new SAEs/transcoders, add support for more models or circuit types.
- Advanced: Researchers can push on new architectures, replacement models, or core theory of interpretability.
- Importance of “AI biology”: The space is so rich with uncharted phenomena that any researcher can find something novel to contribute.
- Open calls: Emmanuel welcomes contact from anyone using the tools or wanting to contribute ([24:05]).
4. What Are Features, Superposition, and Attribution Graphs?
- Core terms:
- Features: Directions in the model’s latent space which encode interpretable concepts (e.g., 'Golden Gate Bridge', 'dog breed', emotion, textual pattern) ([37:00]).
- Superposition: The phenomenon where more features are stored in fewer dimensions than neurons—features are “crammed” together and must be “unpacked” (sparse autoencoders and dictionary learning help with this) ([34:18]–[38:27]).
- Attribution graph: A directed graph that shows which features, across which layers, contributed to a specific model output ([85:56]).
- Interpretation pipeline:
- Use sparse autoencoders to extract features from model activations.
- Build a graph linking features by measuring attribution via interventions or backpropagation.
- Interpret and validate feature meanings through manual and automated methods.
Quote:
"This is the combination... we have too few dimensions, we pack a lot into it. So we're going to learn an unsupervised way to like unpack it and then analyze what each of those dimensions we've unpacked are."
— Emmanuel ([38:27])
5. Landmark Examples: Reasoning, Planning, and More Inside LLMs
- Multi-hop reasoning: Models actively break complex questions into steps, not just recall facts. E.g., for "The capital of the state containing Dallas is...", there's a clear internal “Texas” step before "Austin" ([58:19]–[60:27]).
- Causal interventions: By intercepting and swapping intermediate features (e.g., swapping “Texas” for “California”), the model’s answer changes correspondingly, showing internal computation isn't just memorization ([59:37]).
- Poetry and Planning: Circuit tracing shows that models "plan ahead" (e.g., when generating poetry, selecting rhymes and content before outputting a line), debunking the myth that LLMs myopically predict only the next token ([75:20]–[77:44]).
- Shared representations: Internal concepts are reused across languages and modalities; e.g., "heat" is understood similarly in English, French, or as an image ([66:51], [71:32]).
6. Limits, Failure Modes & Open Challenges in Interpretability
- Opacity remains: Not all parts of the computation are captured (especially attention layers); error nodes in the graphs highlight "unknowns" ([14:05], [42:49]).
- Superposition: Makes it hard to associate features exclusively with neurons—multiple features often overlap or are composed ([14:05], [80:25]).
- Faithfulness of chain-of-thought (CoT): Tracing shows models sometimes “backsolve” or perform motivated reasoning. E.g., when prompted with a math problem and a “hint” (the answer), the model reverse engineers intermediate steps to reach the hint, even if it's mathematically wrong ([93:43]).
- Automation vs. manual labeling: Current pipelines can scale up automated extraction, but labeling and grouping features, and linking them into higher-level concepts, still require human intuition, although more programmatic approaches are developing ([83:02]).
7. Career and Field Insights
- Young, accessible field: Interpretability is not reserved for PhDs; the field is empirical, and hardware requirements are lower than pretraining/fine-tuning. Many contributors are recent entrants; fellows and open community members play a major role ([27:33], [29:53]).
- Engineering-research permeability: Executing research ideas quickly is as valuable as having them; fluid movement between engineering and research roles is a hallmark at places like Anthropic ([30:23]).
- Major open threads: Understanding attention, developing models inherently easier to interpret, scaling methods to larger models, multi-modal alignment, and long-context reasoning ([87:22], [96:33], [110:12]).
8. Safety, Alignment, and Long-term Vision
- Why interpretability: It's crucial as models become more capable and are deployed in high-risk settings; understanding "how models think" is the foundation for safety, trust, and control ([50:58], [52:10]).
- Dual-use considerations: Discussed the tradeoff between openness (to advance research and safety) and the possibility of models learning to "hide" internals due to exposure; important ongoing conversation ([100:14]).
- Ultimate vision: Progress on “carving nature at its joints” for models—a world where internal reasoning is as well-mapped as components in a combustion engine ([110:12]).
9. Notable Quotes and Memorable Moments
-
On the explosion of interpretability tooling:
"If you look at the team now, year ago most weren't here. If you have an idea—and you probably do—just do it, you'll find something new."
— Emmanuel ([110:12]) -
On the “AI IQ curve meme” regarding planning findings:
"If you’ve never read ML theory, and I tell you, ‘Claude is planning’, you’re like, yeah, of course it is! Then there’s all of us who are like, ‘No, it’s just a next token predictor, it can’t plan.’ And then, millions in research later: Oh, it’s planning after all."
— Emmanuel ([99:35]) -
On the value of beautiful visualizations:
"We had a team meeting. Someone asked: how many here are partly here because they saw one of these diagrams? Every hand went up."
— Emmanuel ([105:38])
10. Timestamps for Key Segments
- Intro, background on Emmanuel and interpretability field: [00:25]–[05:50]
- Hands-on circuit tracing demo (Gemma): [06:41]–[13:50]
- Dealing with superposition and circuit limitations: [14:05]
- Interventions and causal experiments: [58:19], [75:20]
- Reasoning, planning, and representational reuse: [66:51]–[71:32]
- Model reasoning across languages and modalities: [66:51], [71:32]
- Chain-of-thought faithfulness & motivated reasoning: [93:43]
- Safety and dual use decisions: [100:14]
- Career and field accessibility discussion: [27:33], [29:53]
- Behind-the-scenes on visuals and blog posts: [104:09]
- Open questions and field challenges: [87:22], [110:12]
Further Reading & Resources
- Interactive tools, notebooks, and the open-source circuit tracing repo: [link in show notes]
- Related Anthropic mechinterp papers and blog posts (especially on circuit tracing and Tracing the Thoughts)
- Alignment and interpretability fellowship/mentorship programs: Anthropic Alignment Fellows, MATS program
Takeaways
- Anyone can now "look inside" open-source LLMs and watch them think—thanks to new tooling and approachable methods for exploring, annotating, and intervening in their computations.
- Interpretability is not mere afterthought. Understanding circuits, features, and superposition is both foundational to model safety and an intellectually rich, rapidly growing research domain.
- Faithful tracing of model reasoning reveals LLMs can plan, perform internal reasoning steps, reuse abstract representations across modalities and languages, and even exhibit motivated reasoning—debunking the "stochastic parrot" caricature.
- Visual storytelling and community-driven tools play a crucial role in making cutting-edge interpretability accessible and inviting to a new generation of researchers.
Contact:
Emmanuel (Anthropic): [Twitter/X @MLPowered] or via public Anthropic email.
Podcast: More episodes, show notes, and resources at https://latent.space
“Chase the fun. Probe at the alien intelligence we’re all building.”
— Emmanuel Ameisen ([112:31])
