Latent Space: The AI Engineer Podcast
Episode: The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI
Date: February 6, 2026
Guests:
- Myra Deng, Head of Product, Goodfire AI
- Mark Bissell, Technical Staff, Goodfire AI
Episode Overview
This episode spotlights Goodfire AI, a pioneering lab at the intersection of frontier AI research and real-world deployment, specializing in mechanistic interpretability ("mechinterp"). The discussion unpacks Goodfire’s mission to make AI models safer, more robust, and more customizable through advanced interpretability—applied at production scale in high-stakes industries like healthcare, code, and scientific discovery. The team also shares insights from their $150M Series B fundraise, offers a live demo of real-time model steering, and discusses the evolving landscape of AI interpretability research and its pragmatic challenges.
Key Topics and Discussion Points
1. Introducing Goodfire AI’s Mission and Background (00:29-05:01)
-
Primary Focus:
Goodfire is "an AI research lab that focuses on using interpretability to understand, learn from and design AI models."
— Myra Deng (00:29) -
Vision:
Interpretability will unlock the next frontier of safe and powerful AI models by moving from research-only, "toy model" settings to real-world deployment.
— Mark Bissell (01:02) -
Growth & Fundraise:
Announced $150M Series B at a $1.25B valuation, with rapid staff growth from 10 to 40+ (01:37). -
Backgrounds:
Mark came from Palantir (healthcare/data), Myra from Two Sigma (finance/ML), both with generalist roles and early team responsibilities.
Notable Quote:
"We really believe interpretability will unlock the new generation next frontier of safe and powerful AI models."
— Myra Deng (00:35)
2. Interpretability: Definitions, Role, and Production Impact (05:07-14:07)
-
What is Mechanistic Interpretability?
Vivu describes it as probing a model's internal mechanisms (probing, SAEs, transcoders, activation mapping, steering), moving from input-output black box toward internal understanding."If you ask 50 people who work in Interp what is interpretability, you'll probably get 50 different answers."
— Mark Bissell (06:13) -
Goodfire’s Approach:
Interpretability as part of a broader, more scientific approach to deep learning—not limited to post hoc analysis. Emphasis on applying interpretability in production scenarios. -
Use Cases:
From removing political bias vectors ("CCP vector," 09:29) to addressing "double descent" and grokking phenomena in ML models. -
Deployment Context:
Real-world guardrailing (e.g., for PII at Rakuten, 18:25), cross-lingual and cross-domain requirements, efficiency and low-latency advantages over black-box methods.
Memorable Moments:
"Nobody knows what's going on. Right. Subliminal learning is just an insane concept when you think about it."
— Mark Bissell (12:07)
3. Workflow: How Goodfire Approaches Interp Research Problems (14:07-18:15)
-
Problem Selection:
Start by identifying what isn't working in ML through customer/researcher conversations, try SOTA methods (e.g., SAEs, probes), evaluate failures, iterate research agenda. -
Evolving Beyond SAEs:
Noted limitations in using SAEs for certain types of behavior detection—sometimes raw activation probes outperform SAE-based ones in tasks like PII/harmful intent detection.
Notable Quote:
"We have definitely run into cases where I think the concept space described by SAEs is not as clean and accurate as we would expect it to be for actual real world downstream performance metrics."
— Myra Deng (17:28)
4. Deployment Case Study: Rakuten and Real-World Challenges (18:15-21:12)
-
Production Use:
Rakuten uses Goodfire to guardrail LLMs for PII at inference time—token-level classification, synthetic-to-real transfer, multilingual support (Japanese quirks), stringent latency constraints. -
Efficiency:
Probes are lightweight, dynamic, add no extra latency (21:12). -
Complexity:
Real-world data introduces unforeseen bugs, multilingual issues force custom solutions.
5. Live Demo: Steering a 1-Trillion-Parameter Model in Real Time (21:27-29:33)
-
Demo Setup:
Steering Gen Z behavior in Kimi K2, a 1T parameter model, via CLI (22:48-24:48).
Editing behavior live expands possibilities for API-based customizations. -
SAE/Feature Discovery:
How features representing specific behaviors ("Gen Z slang") are discovered and labeled. -
Potential:
Real-time steering as a knob for customizing models post-training; "inference time surgical edits."
Memorable Moment (Demo):
"We're gonna start seeing Kimmy transition as the steering kicks in from normal Kimmy to Gen Z Kimmy and both in its chain of thought and actual outputs..."
— Mark Bissell (23:41)
6. Fine-Tuning, Prompting, and Steering: Comparing Approaches (29:45-35:11)
- Relation to Prompting:
Steering and in-context learning (prompting) are formally equivalent in the limit
(Paper: Belief Dynamics Reveal the Dual Nature of In Context Learning and Activation Steering, 32:59).
"You can almost write a formula that tells you how to convert between the two of them."
— Mark Bissell (31:12)
- Parameter vs. Activation Space:
Compared steering (activations) to LoRA/adapters (parameter updates). Steering sometimes offers finer-grained, real-time control, but both have roles.
"...are you modifying the pipes, or are you modifying the water flowing through the pipes to get what you're after?"
— Mark Bissell (34:30)
7. Scaling, Research Accessibility & Community Growth (36:35-40:51)
-
Accessibility:
Mechinterp is approachable—training SAEs/probes has low compute requirements ("thousands of dollars"), notebooks and code from academic/community sources. -
Open Problems & Community:
Recommended reading: Lee Sharkey's "Open Problems in Interpretability" paper (38:09).
Growing community, enthusiastic new researchers. Programs like MATS (Machine Learning and Alignment Theory Scholars) help onboard talent.
"Every incoming PhD student wants to study interpretability, which was not the case a few years ago."
— Myra Deng (39:05)
- Industry Applications:
First-ever mechanistic interpretability track at AI Engineering Europe (40:18), reflecting the transition from "toy" applications to industry relevance.
8. Breaching the Frontier: Scientific Models, Healthcare, and World Models (46:18-55:41)
-
Healthcare Applications:
Partnering with Mayo Clinic and others:- Using interp to debug and vet medical/biological models (e.g., ensure genomics models don’t pick up on ancestry correlates instead of causal biology).
- Applying foundation models and interpretability to find novel biomarkers for diseases (e.g., Alzheimer's).
-
Generalization Across Domains:
The same interpretability methods apply to models in robotics, material science, code, and more. -
Pixel/World Models:
Unique interpretability opportunities (visual concepts are more easily grokked than text concepts), faster feedback, easier application for safety and anomaly detection in video/data.
9. Sci-Fi, Safety, and Alignment (55:41-63:57)
- Philosophical Reflections:
Referenced sci-fi author Ted Chiang—his stories explore alien intelligence and the challenges of AI-human communication, relevant to interpretability and alignment.
"That is literally about a robot doing interpretability on its own mind."
— Mark Bissell on Chiang's "Exhalation" (58:34)
-
Safety and Alignment:
Goodfire’s stance is pragmatic—not "militant" safety but integrated, technical solutions for trustworthy model deployment.
The community is broadly cohesive in desiring greater model understanding for safe deployment (61:46-62:28). -
Weak-to-Strong Generalization:
Raises open problem: As models surpass human intelligence, will supervised interp strategies continue to work?
Referenced OpenAI "Weak to Strong Generalization" paper (64:27).
Notable Quotes & Segments by Timestamp
- "Interpretability will unlock the new generation next frontier of safe and powerful AI models." — Myra Deng (00:35)
- "If you ask 50 people who work in Interp what is interpretability, you'll probably get 50 different answers." — Mark Bissell (06:13)
- "Nobody knows what's going on. Right. Subliminal learning is just an insane concept..." — Mark Bissell (12:07)
- "Probes are lightweight, adds no extra latency." — Mark Bissell (21:12)
- "We're gonna start seeing Kimmy transition as the steering kicks in from normal Kimmy to Gen Z Kimmy and both in its chain of thought and actual outputs..." — Mark Bissell (23:41)
- "It's the blessing and the curse of unsupervised methods..." — Mark Bissell (17:35)
- "You can almost write a formula that tells you how to convert between [prompting and steering]." — Mark Bissell (31:12)
- "Every incoming PhD student wants to study interpretability, which was not the case a few years ago." — Myra Deng (39:05)
- "We didn't really have to learn too much about [the new domains]; interp techniques scale pretty well across domains." — Myra Deng (51:19)
- "That is literally about a robot doing interpretability on its own mind." — Mark Bissell (58:34)
- "I think the concept space described by SAEs is not as clean and accurate as we would expect." — Myra Deng (17:28)
- "We're looking for design partners across many domains... reasoning models, world models, robotics..." — Myra Deng (52:03)
Key Resources, Papers, and Programs Mentioned
- Goodfire AI Careers: Actively hiring technical staff and design partners
- Papers:
- Open Problems in Interpretability — Lee Sharkey et al.
- [Belief Dynamics Reveal the Dual Nature of In Context Learning and Activation Steering] (32:59, not directly linked)
- OpenAI’s Weak to Strong Generalization
- Open Source & Community:
- MATS (Machine Learning Alignment Theory Scholars)
- Mechanistic interpretability Slack/Discords
- Neuronpedia: visualizing neurons and model concepts
Calls to Action
-
Product/Research Partnerships:
Goodfire is seeking design and deployment partners in healthcare, reasoning models, code, world/pixel/robotic models, and other high-stakes applications.
Contact via Goodfire AI website. -
Individuals:
If you have projects where foundation models are "almost good enough but need a magical knob to tune," Goodfire can help. -
Researchers and Engineers:
Field is rapidly growing, accessible, and open to collaboration. Reach out if interested in mechanistic interpretability or joining the industry transition.
Closing Thoughts
The episode underscores mechanistic interpretability’s quick evolution from academic research to industrial application. Many challenges remain—like fine-grained behavior control, robust safety, and scaling interp to multimodal and world models. Yet, the tools, resources, and community to solve these challenges have never been more accessible. Goodfire AI stands at the forefront, inviting both design partners and curious engineers to help shape the safe, controllable, and explainable AI systems of tomorrow.
