Summary7 min read

Latent Space: The AI Engineer Podcast

Episode: The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI
Date: February 6, 2026
Guests:

Myra Deng, Head of Product, Goodfire AI
Mark Bissell, Technical Staff, Goodfire AI

Episode Overview

This episode spotlights Goodfire AI, a pioneering lab at the intersection of frontier AI research and real-world deployment, specializing in mechanistic interpretability ("mechinterp"). The discussion unpacks Goodfire’s mission to make AI models safer, more robust, and more customizable through advanced interpretability—applied at production scale in high-stakes industries like healthcare, code, and scientific discovery. The team also shares insights from their $150M Series B fundraise, offers a live demo of real-time model steering, and discusses the evolving landscape of AI interpretability research and its pragmatic challenges.

Key Topics and Discussion Points

1. Introducing Goodfire AI’s Mission and Background (00:29-05:01)

Primary Focus:
Goodfire is "an AI research lab that focuses on using interpretability to understand, learn from and design AI models."
— Myra Deng (00:29)
Vision:
Interpretability will unlock the next frontier of safe and powerful AI models by moving from research-only, "toy model" settings to real-world deployment.
— Mark Bissell (01:02)
Growth & Fundraise:
Announced $150M Series B at a $1.25B valuation, with rapid staff growth from 10 to 40+ (01:37).
Backgrounds:
Mark came from Palantir (healthcare/data), Myra from Two Sigma (finance/ML), both with generalist roles and early team responsibilities.

Notable Quote:

"We really believe interpretability will unlock the new generation next frontier of safe and powerful AI models."
— Myra Deng (00:35)

2. Interpretability: Definitions, Role, and Production Impact (05:07-14:07)

What is Mechanistic Interpretability?
Vivu describes it as probing a model's internal mechanisms (probing, SAEs, transcoders, activation mapping, steering), moving from input-output black box toward internal understanding.

"If you ask 50 people who work in Interp what is interpretability, you'll probably get 50 different answers."
— Mark Bissell (06:13)
Goodfire’s Approach:
Interpretability as part of a broader, more scientific approach to deep learning—not limited to post hoc analysis. Emphasis on applying interpretability in production scenarios.
Use Cases:
From removing political bias vectors ("CCP vector," 09:29) to addressing "double descent" and grokking phenomena in ML models.
Deployment Context:
Real-world guardrailing (e.g., for PII at Rakuten, 18:25), cross-lingual and cross-domain requirements, efficiency and low-latency advantages over black-box methods.

Memorable Moments:

"Nobody knows what's going on. Right. Subliminal learning is just an insane concept when you think about it."
— Mark Bissell (12:07)

3. Workflow: How Goodfire Approaches Interp Research Problems (14:07-18:15)

Problem Selection:
Start by identifying what isn't working in ML through customer/researcher conversations, try SOTA methods (e.g., SAEs, probes), evaluate failures, iterate research agenda.
Evolving Beyond SAEs:
Noted limitations in using SAEs for certain types of behavior detection—sometimes raw activation probes outperform SAE-based ones in tasks like PII/harmful intent detection.

Notable Quote:

"We have definitely run into cases where I think the concept space described by SAEs is not as clean and accurate as we would expect it to be for actual real world downstream performance metrics."
— Myra Deng (17:28)

4. Deployment Case Study: Rakuten and Real-World Challenges (18:15-21:12)

Production Use:
Rakuten uses Goodfire to guardrail LLMs for PII at inference time—token-level classification, synthetic-to-real transfer, multilingual support (Japanese quirks), stringent latency constraints.
Efficiency:
Probes are lightweight, dynamic, add no extra latency (21:12).
Complexity:
Real-world data introduces unforeseen bugs, multilingual issues force custom solutions.

5. Live Demo: Steering a 1-Trillion-Parameter Model in Real Time (21:27-29:33)

Demo Setup:
Steering Gen Z behavior in Kimi K2, a 1T parameter model, via CLI (22:48-24:48).
Editing behavior live expands possibilities for API-based customizations.
SAE/Feature Discovery:
How features representing specific behaviors ("Gen Z slang") are discovered and labeled.
Potential:
Real-time steering as a knob for customizing models post-training; "inference time surgical edits."

Memorable Moment (Demo):

"We're gonna start seeing Kimmy transition as the steering kicks in from normal Kimmy to Gen Z Kimmy and both in its chain of thought and actual outputs..."
— Mark Bissell (23:41)

6. Fine-Tuning, Prompting, and Steering: Comparing Approaches (29:45-35:11)

Relation to Prompting:
Steering and in-context learning (prompting) are formally equivalent in the limit
(Paper: Belief Dynamics Reveal the Dual Nature of In Context Learning and Activation Steering, 32:59).

"You can almost write a formula that tells you how to convert between the two of them."
— Mark Bissell (31:12)

Parameter vs. Activation Space:
Compared steering (activations) to LoRA/adapters (parameter updates). Steering sometimes offers finer-grained, real-time control, but both have roles.

"...are you modifying the pipes, or are you modifying the water flowing through the pipes to get what you're after?"
— Mark Bissell (34:30)

7. Scaling, Research Accessibility & Community Growth (36:35-40:51)

Accessibility:
Mechinterp is approachable—training SAEs/probes has low compute requirements ("thousands of dollars"), notebooks and code from academic/community sources.
Open Problems & Community:
Recommended reading: Lee Sharkey's "Open Problems in Interpretability" paper (38:09).
Growing community, enthusiastic new researchers. Programs like MATS (Machine Learning and Alignment Theory Scholars) help onboard talent.

"Every incoming PhD student wants to study interpretability, which was not the case a few years ago."
— Myra Deng (39:05)

Industry Applications:
First-ever mechanistic interpretability track at AI Engineering Europe (40:18), reflecting the transition from "toy" applications to industry relevance.

8. Breaching the Frontier: Scientific Models, Healthcare, and World Models (46:18-55:41)

Healthcare Applications:
Partnering with Mayo Clinic and others:
- Using interp to debug and vet medical/biological models (e.g., ensure genomics models don’t pick up on ancestry correlates instead of causal biology).
- Applying foundation models and interpretability to find novel biomarkers for diseases (e.g., Alzheimer's).
Generalization Across Domains:
The same interpretability methods apply to models in robotics, material science, code, and more.
Pixel/World Models:
Unique interpretability opportunities (visual concepts are more easily grokked than text concepts), faster feedback, easier application for safety and anomaly detection in video/data.

9. Sci-Fi, Safety, and Alignment (55:41-63:57)

Philosophical Reflections:
Referenced sci-fi author Ted Chiang—his stories explore alien intelligence and the challenges of AI-human communication, relevant to interpretability and alignment.

"That is literally about a robot doing interpretability on its own mind."
— Mark Bissell on Chiang's "Exhalation" (58:34)

Safety and Alignment:
Goodfire’s stance is pragmatic—not "militant" safety but integrated, technical solutions for trustworthy model deployment.
The community is broadly cohesive in desiring greater model understanding for safe deployment (61:46-62:28).
Weak-to-Strong Generalization:
Raises open problem: As models surpass human intelligence, will supervised interp strategies continue to work?
Referenced OpenAI "Weak to Strong Generalization" paper (64:27).

Notable Quotes & Segments by Timestamp

"Interpretability will unlock the new generation next frontier of safe and powerful AI models." — Myra Deng (00:35)
"If you ask 50 people who work in Interp what is interpretability, you'll probably get 50 different answers." — Mark Bissell (06:13)
"Nobody knows what's going on. Right. Subliminal learning is just an insane concept..." — Mark Bissell (12:07)
"Probes are lightweight, adds no extra latency." — Mark Bissell (21:12)
"We're gonna start seeing Kimmy transition as the steering kicks in from normal Kimmy to Gen Z Kimmy and both in its chain of thought and actual outputs..." — Mark Bissell (23:41)
"It's the blessing and the curse of unsupervised methods..." — Mark Bissell (17:35)
"You can almost write a formula that tells you how to convert between [prompting and steering]." — Mark Bissell (31:12)
"Every incoming PhD student wants to study interpretability, which was not the case a few years ago." — Myra Deng (39:05)
"We didn't really have to learn too much about [the new domains]; interp techniques scale pretty well across domains." — Myra Deng (51:19)
"That is literally about a robot doing interpretability on its own mind." — Mark Bissell (58:34)
"I think the concept space described by SAEs is not as clean and accurate as we would expect." — Myra Deng (17:28)
"We're looking for design partners across many domains... reasoning models, world models, robotics..." — Myra Deng (52:03)

Key Resources, Papers, and Programs Mentioned

Goodfire AI Careers: Actively hiring technical staff and design partners
Papers:
- Open Problems in Interpretability — Lee Sharkey et al.
- [Belief Dynamics Reveal the Dual Nature of In Context Learning and Activation Steering] (32:59, not directly linked)
- OpenAI’s Weak to Strong Generalization
Open Source & Community:
- MATS (Machine Learning Alignment Theory Scholars)
- Mechanistic interpretability Slack/Discords
- Neuronpedia: visualizing neurons and model concepts

Calls to Action

Product/Research Partnerships:
Goodfire is seeking design and deployment partners in healthcare, reasoning models, code, world/pixel/robotic models, and other high-stakes applications.
Contact via Goodfire AI website.
Individuals:
If you have projects where foundation models are "almost good enough but need a magical knob to tune," Goodfire can help.
Researchers and Engineers:
Field is rapidly growing, accessible, and open to collaboration. Reach out if interested in mechanistic interpretability or joining the industry transition.

Closing Thoughts

The episode underscores mechanistic interpretability’s quick evolution from academic research to industrial application. Many challenges remain—like fine-grained behavior control, robust safety, and scaling interp to multimodal and world models. Yet, the tools, resources, and community to solve these challenges have never been more accessible. Goodfire AI stands at the forefront, inviting both design partners and curious engineers to help shape the safe, controllable, and explainable AI systems of tomorrow.

For Full Show Notes and Links:

latent.space

Loading summary

Transcript239 lines

[00:06]
A
So welcome to the Latent Space five. We're back in the studio with our special Mechinterp co host, Vibu. Welcome.
[00:13]
B
Mochi.
[00:13]
A
Mochi special co host and Mochi, the mechanistic interpretability doggo. We have with us Mark and Myra from Goodfire. Welcome.
[00:22]
B
Thanks for having us on.
[00:23]
A
Maybe we can sort of introduce Goodfire and then introduce you guys. How do you introduce Goodfire today?
[00:30]
C
Yeah, it's a great question. So Goodfire, we like to say, is an AI research lab that focuses on using interpretability to understand, learn from and design AI models. And we really believe that interpretability will unlock the new generation next frontier of safe and powerful AI models. That's our description right now. And I'm excited to dive more into the work we're doing to make that happen.
[00:56]
A
Yeah, and there's always like the official description. Is there an unofficial one that sort of resonates more with a different audience?
[01:02]
B
Well, being an AI research lab that's focused on interpretability, there's obviously a lot of people have a lot that they think about when they think of interpretability. And I think we have a pretty broad definition of what that means and the types of places that can be applied and in particular applying it in production scenarios in high stakes industries and really taking it sort of from the research world into the real world, which it's a new field, so that hasn't been done all that much. And we're excited about actually seeing that sort of put into practice.
[01:38]
A
Yeah, I would say it wasn't too long ago that Anthopic was still putting out toy models or supervisor position and that kind of stuff. And I wouldn't have pegged it to be this far along. When you and I talked at Neurips, you were talking a little bit about your production use cases and your customers and then not to bury the lead. Today we're also announcing the Fundraise your series B150 million doll at a 1,25B valuation. Congrats. You're a unicorn.
[02:03]
B
Thank you. Yeah, no, things move fast. We were talking to you in December and already some big updates since then.
[02:08]
A
Let's dive I guess into a bit of your backgrounds as well. Mark, you were a palantir working on health stuff, which is really interesting because Goodfire has some interesting health use cases. I don't know how related they are in practice.
[02:22]
B
Yeah, not super related, but I don't know, it was helpful context to know what it's like just to work with health systems and generally in that domain.
[02:32]
A
Yeah. Mara, you were at two Sigma which actually I was also at two Sigma back in the day.
[02:37]
C
Wow, nice. Did we overlap at all?
[02:39]
A
No, this is when I was briefly a software engineer before I became a sort of developer relations person. And now you're head of product. What are your sort of respective roles? Just to introduce people to what all gets done in Goodfire.
[02:51]
B
Yeah, Prior to Goodfire I was at Palantir for about three years as a forward deployed engineer, now a hot term, wasn't always that way. And as a technical lead on the healthcare team. And at Goodfire I'm a. A member of the technical staff and honestly that I think is about as specific as I could describe myself because I've worked on a range of things and it's a fun time to be at a team that's still reasonably small. I think when I joined one of the first 10 employees. Now we're above 40 but still it looks like there's always a mix of research and engineering and product and all of the above that needs to get done and I think everyone across the team is pretty switch hitter in the roles they do. So I think you've seen some of the stuff that I worked on related to image models which was sort of like a research demo. More recently I've been working on our scientific discovery team with some of our life sciences partners, but then also building out our core platform from more of like flexing some of the kind of MLE and developer skills as well.
[03:53]
A
Very generalist. And you also had like a very like a founding engineer type role.
[03:59]
C
Yeah, yeah. So I also started as I still am member of technical staff, did a wide range of things from the very beginning including like finding our office space and all of these nitty gritty.
[04:10]
A
We both, we both visited when you had that open house thing. It was really nice.
[04:13]
C
Thank you. Thank you. Yeah, Plug to come visit our office.
[04:17]
A
It looked like it was, it was like 200 people. Like it has room for 200 people but you guys are like 10. Yeah, the growing pot.
[04:23]
B
For a while it was very empty.
[04:26]
C
But yeah, like, like Mark, I spend a lot of my time as head of product. I think product is a bit of a weird role these days, but a lot of it is thinking about how do we take our frontier research and really apply it to the most important real world problems and how does that then translate into a platform that's repeatable or a product and working across the engineering and research teams to make that happen and also communicating to the world what is interpretability, what is it used for what is it good for? Why is it so important? All of these things are part of my day to day as well.
[05:02]
A
I love what is things because that's a very crisp starting point for people coming to a field.
[05:08]
D
Vivu.
[05:08]
A
I'll do a fun thing. Vibhu. Why you want to try tackling what is interpretability and then they can correct us.
[05:14]
D
Okay, great. So I think one just to kick off, it's a very interesting role to be hit a product, right. Because you guys, at least as a lab, you're more of an applied interp lab, which is pretty different than than just normal interp, like a lot of background research. But you guys actually ship an API to try these things. You have ember, you have products around it which not many do. Okay, what is Interp? So basically you're trying to have an understanding of what's going on in the model in the internal. So different approaches to do that. You can do probing, SAEs, transcoders, all this stuff. But basically you have a hypothesis, you have something that you want to learn about what's happening in a model internals and then you're trying to solve that from there. You can do stuff like you can do activities, activation mapping, you can try to do steering. There's a lot of stuff that you can do. But the key question is from input to output, we want to have a better understanding of what's happening and how can we adjust what's happening on the model internals. How did I do?
[06:12]
C
That was really good.
[06:13]
B
I think that was great. I think it's kind of a minefield of if you ask 50 people who work in Interp what is interpretability, you'll probably get 50 different answers. And to some extent also where Goodfire sits in the space, I think that we're an AI research company above all else and interpretability is a set of methods that we think are really useful and worth kind of specializing in in order to accomplish the goals we want to accomplish. But I think we also sort of see some of the goals as even more broader, as almost like the science of deep learning and just taking a not black box approach to kind of any part of the AI development lifecycle. Whether that means using Interp for data curation while you're training your model, or for understanding what happened during post training or for the understanding activations and internal representations, what is in there semantically. And then a lot of exciting updates that are also part of the fundraise around bringing interpretability to training, which I don't think has Been done all that much before. A lot of this stuff is post. Post hoc poking at models as opposed to actually using this to intentionally design them.
[07:30]
A
Is this post training or pre training? Or is that not a useful post training?
[07:35]
C
But there's no reason the techniques wouldn't also work in pre training.
[07:38]
A
Yeah, it seems like it would be more applicable post training because basically I'm thinking like rollouts or having different variations of a model that you can tweak with your steering.
[07:50]
C
Yeah. And I think in a lot of the news that you've seen on Twitter or whatever, you've seen a lot of unintended side effects come out of post training processes. Overly sycophantic models or models that exhibit strange reward hacking behavior. I think these are extreme examples. There's also very more mundane enterprise use cases where they try to customize or post train a model to do something and it learns some noise or it doesn't appropriately learn the target task. And a big question that we've always had is how do you use your understanding of what the model knows and what it's doing to actually guide the learning process more effectively?
[08:28]
A
Yeah, I mean, just to anchor this for people. One of the biggest controversies of last year was 4.0 Glazegate.
[08:36]
D
I've never heard.
[08:37]
A
I didn't know that was what it was called. That's when they called it that on the blog post. And I was like, why did OpenAI call it officially use that term? And I'm like, that's funny. But yeah, I guess is the pitch that if they had worked with a good fire, they wouldn't have avoided it?
[08:52]
C
I think so, yeah.
[08:54]
B
I think that's certainly one of the use cases, I think. And another reason why post training is a place where this makes a lot of sense is a lot of what we're talking about is surgical edits. You want to be able to have expert feedback very surgically change how your model is doing, whether that is removing a certain behavior that it has. So one of the things that we've been looking at or is another common area where you would want to make a somewhat surgical edit is some of the models that have, say, political bias, like you look at Quinn or R1 and they have sort of this CCP bias in them.
[09:28]
A
Is there a CCP vector?
[09:29]
B
Well, there are certainly internal parts of the representation space where you can sort of see where that lives. Yeah. And you want to kind of extract that piece out.
[09:40]
A
Well, whenever you find a vector, a fun exercise is just make it very negative to see what the opposite of.
[09:46]
B
CCP is the Super America bald eagles flying everywhere. But yeah. So in general, lots of post training tasks where you'd want to be able to do that, whether it's unlearning a certain behavior or some of the other kind of cases where this comes up. Are you familiar with the grokking behavior?
[10:06]
A
I mean, I know the machine learning term of grokking.
[10:10]
B
Yeah. Sort of this double descent idea of having a model that is able to learn a generalizing, generalizing solution as opposed to. Even if memorization of some task would suffice, you want it to learn the more general way of doing a thing. And so another way that you can think about having surgical access to a model's internals would be learn from this data, but learn in the right way. If there are many possible ways to do that.
[10:38]
A
Can MEC interpr solve the double descent problem?
[10:41]
B
Depends, I guess, on how you.
[10:43]
A
Okay, so I view double descent as a problem because then you're like, well, if the loss curves level out, then you're done. But maybe you're not done.
[10:51]
B
Right.
[10:52]
A
But if you actually can interpret what is generalizing or what is still changing even though the loss is not changing, then maybe you can actually not view it as a double descent problem and actually you're just sort of translating the space in which you view loss and then you have a smooth curve.
[11:10]
B
Yeah, I think that's certainly like the domain of problems that we're looking to get at. Yeah.
[11:16]
A
To me, double descent is like the biggest thing to ML research where if you believe in scaling, then you need to know where to scale. But if you believe in double descent, then you don't believe in anything. Where anything levels off.
[11:30]
D
Yeah, I mean, also tangentially there's like, okay, when you talk about the China vector, Right. There's the subliminal learning work. It was from the Anthropic Fellows program where basically you can have hidden biases in a model and as you distill down or as you train on distilled data, those biases always show up, even if you explicitly try to not train on them. So it's just like another use case of, okay, if we can interpret what's happening in post training, can we clear some of this? Can we even determine what's there? Because yeah, it's just like some worrying research that's out there that shows we really don't know what's going on.
[12:07]
B
Yeah, I think that's the biggest sentiment that we're sort of hoping to tackle. Nobody knows what's going on. Right. Subliminal learning is just an insane concept when you think about it. Right. Train a model on not even the logits, literally the output text of a bunch of random numbers. And now your model loves owls and you see behaviors like that that are just, they defy intuition. And there are mathematical explanations that you can get into. But still early days.
[12:35]
A
Objectively, there are a sequence of numbers that are more owl like than others. There should be.
[12:42]
B
According to certain models.
[12:43]
C
Right.
[12:44]
B
It's interesting. I think it only applies to models initialized from the same starting usually.
[12:50]
A
But I think that's a cheat code because there's not enough compute. But if you believe in platonic representation, probably it will transfer across different models.
[13:00]
B
Oh, you think so? I think of it more as a statistical artifact of models initialized from the same seed. Sort of. There's something that is path dependent from that seed that might cause certain overlaps in the latent space. And then sort of doing this distillation. Yeah. Pushes it towards having certain other tendencies.
[13:24]
A
Got it.
[13:25]
D
I think there's more research, a bunch of these open ended questions, right? Like you can't train in new stuff during the RL phase. Right. RL only reorganizes weights and you can only do stuff that's somewhat there in your base model. You're not learning new stuff, you're just reordering chains and stuff. But okay, my broader question is when you guys work at an interp lab, how do you decide what to work on and what's kind of the thought process? Right. Because we can ramble for hours. Okay, I want to know this, I want to know that. But how do you concretely, what's the workflow? Okay, there's approaches towards solving a problem. Right. I can try prompting, I can look at chain of thought, I can train probes, SAEs. But how do you determine, okay, is this going anywhere? Do we have of said stuff? Just, you know, if you can tell me about that.
[14:08]
C
It's a really good question. I feel like we've always at the very beginning of the company thought about like, let's go and try to learn what isn't working in machine learning today. Whether that's talking to customers or talking to researchers at other labs, trying to understand both where the frontier is going and where things are really not falling apart today. And then developing a perspective on how we can push the frontier using interpretability methods. And so even our chief scientist, Tom, spends a lot of time talking to customers and trying to understand what real world problems are. And then Taking that back and trying to apply the current state of the art in interp to those problems and then seeing where they fall down, basically, and then using those failures or those shortcomings to understand what hills to climb when it comes to interpretability research. So, like on the fundamental side, for instance, when we have done some work applying SAEs and probes, we've encountered some shortcomings in SAEs that we found a little bit surprising and so have gone back to the drawing board and done work on better foundational interpreter models. And a lot of our team's research is focused on what is the next evolution beyond SAEs, for instance. And then when it comes to control and design of models, we tried steering with our first API and realized that it still fell short of black box techniques like prompting or fine tuning, and so went back to the drawing board and we're like, how do we make that not the case and how do we improve it beyond that? And one of our researchers, Ekdeep, who just joined is actually Ecdeep and Atticus are like steering experts and have spent a lot of time trying to figure out what is the research that enables us to actually do this in a much more powerful, robust way. So, yeah, the answer is look at real world problems, try to translate that into a research agenda, and then hill climb on both of those at the same time.
[16:04]
A
Yeah. Mark has the steering CLI demo queued up, which we're going to go into a sec, but I always want to double click on when you drop hints, like, we found some problems with SAEs. Okay, what are they? And then we can go into the demo.
[16:19]
C
Yeah, I mean, I'm curious if you have more thoughts here as well, because you've done it in the healthcare domain. But I think like, for instance, when we do things like trying to detect behaviors within models that are harmful or behaviors that a user might not want to have in their model, so hallucinations, for instance, harmful intent, pii, all of these things. We first tried using SAE probes for a lot of these tasks. So taking the feature activation space from SAEs and then training classifiers on top of that, and then seeing how well we can detect the properties that we might want to detect in model behavior. And we've seen in many cases that probes just trained on raw activations seem to perform better than SAE probes, which is a bit surprising if you think that SAEs are actually also capturing the concepts that you would want to capture cleanly and more surgically. And so that is an interesting Observation, I don't think that is like I'm not down on SAEs at all. I think there are many, many things they're useful for. But we have definitely run into cases where I think the concept space described by saes is not as clean and accurate as we would expect it to be for actual real world downstream performance metrics.
[17:35]
A
Fair enough.
[17:36]
B
Yeah. It's the blessing and the curse of unsupervised methods where you get to peek into the AI's mind, but sometimes you wish that you saw other things when you looked inside there. Although in the PII instance I think weren't an SAE based approach actually did prove to be the most generalized model.
[17:54]
C
It worked in the case that we published with Rakuten and I think a lot of the reasons it worked well was because we had a noisier data set. And so actually the blessing of unsupervised learning is that we actually got to get more meaningful generalizable signal from SAEs when the data was noisy. But in other cases where we've had good datasets, it hasn't been the case.
[18:15]
A
And just because you named Job Rakuten and I don't know if we'll get another chance, what is the overall. What is Rakuten's usage or production usage?
[18:26]
C
Yeah, so they are using us to essentially guardrail and inference time, monitor their language model usage and their agent usage to detect things like PII so that they don't route private user information to downstream model providers. And so that's going through all of their user queries every day. And that's something that we deployed with them a few months ago. And now we are actually exploring very early partnerships not just with Rakuten but with other people around how we can help with potentially training and customization use cases as well.
[19:04]
A
Yeah, for those who don't know like it's Rakuten is like I think number one or number two E commerce store in Japan.
[19:10]
C
Yes.
[19:11]
B
And I think that use case actually highlights a lot of like what it looks like to deploy things in practice that you don't always think about when you're doing sort of research tasks. So when you think about some of the stuff that came up there that's more complex than your idealized version of a problem they were encountering. Things like synthetic to real transfer of methods so they couldn't train probes, classifiers, things like that on actual customer data of pii. So what they had to do is use synthetic data sets and then hope that that transfer is out of domain to real data sets. And so we could evaluate performance on the real data but not train on customer pii. So that right off the bat is a big challenge. You have multilingual requirements, so this needed to work for both English and Japanese text. Japanese text has all sorts of quirks, including tokenization behaviors that caused lots of bugs that caused us to be pulling our hair out. And then also a lot of tasks you'll see you might make simplifying assumptions if you're sort of treating it as the easiest version of the problem to just get general results where maybe you say you're classifying a sentence to say, does this contain pii? But the need that Rakuten had was token level classification so that you could precisely scrub out the pii. So as we learned more about the problem and you're sort of speaking about what that looks like in practice. Yeah, a lot of assumptions end up breaking. And that was just one instance where a problem that seems simple right off the bat ends up being more complex as you keep diving into it.
[20:42]
A
Excellent.
[20:43]
D
One of the things that's also interesting with interp is a lot of these methods are very efficient. Right. So where you're just looking at a model's internals itself compared to a separate like guardrail LLM as a judge, a separate model. One, you have to host it. Two, there's like a whole latency. So if you use like a big model you have a second call. Some of the work around self detection of hallucination, it's also deployed for efficiency. Right. So thinking of someone like Rakuten doing it in production live, you know, that's just another thing people should consider better.
[21:12]
B
Yeah. And something like a probe is super lightweight, adds no extra latency.
[21:16]
A
Really excellent. You have the steering demos lined up so we were just kind of see what you got. I don't actually know if this is like the latest latest or like alpha thing.
[21:27]
B
No, this is a pretty hacky demo from. From a presentation that someone else on the team recently gave. So this will give a sense for. For steering and action. Honestly, I think the biggest thing that this highlights is that as we've been growing as a company and taking on kind of more and more ambitious versions of interpretability related problems, a lot of that comes to scaling up in various different forms. And so here you're going to see steering on a 1 trillion parameter model. This is Kimi K2. And so it's sort of fun that in addition to the research challenges, there are engineering challenges that we're now Tackling, because for any of this to be sort of useful in production, you need to be thinking about what it looks like when you're using these methods on frontier models, as opposed to sort of like toy kind of model organisms. So yeah, this was thrown together hastily. Pretty fragile behind the scenes, but I think it's quite a fun demo. So screen sharing is on. So I've got two terminal sessions pulled up here. On the left is a forked version that we have of the Kimi CLI that we've got running to point at our custom hosted Kimi model. And then on the right is a setup that will allow us to steer on certain concepts. So I should be able to chat with Kimmy over here.
[22:49]
A
Tell it, hello, is this running locally?
[22:52]
B
So the CLI is running locally, but the Kimi server is running back in the office. Well, hopefully should be.
[22:59]
D
That's too much to run on that Mac.
[23:01]
B
Yeah, I think it takes a full like H100 node. I think it's like you can run it on eight GPUs H100 so. So yeah, Kimmy's running. We can ask it a prompt. It's got a forked version of the sglang code base that we've been working on. So I'm going to tell it, hey, this sglang code base is slow. I think there's a bug. Can you try to figure it out? It's a big code base, so it'll spend some time doing this. And then on the right here, I'm going to initialize in real time some steering. Let's see here. Continue searching for any bugs.
[23:37]
A
Feature ID 43205 layers 20, 30, 40 so let me.
[23:41]
B
This is basically a feature that we found that inside Kimmy seems to cause it to speak in Gen Z slang. And so on the left it's still sort of thinking normally it might take, I don't know, 15 seconds for this to kick in, but then we're going to start hopefully seeing it. Dude, this code base is massive. For real. So we're going to start seeing Kimmy transition as the steering kicks in from normal Kimmy to Gen Z Kimmy and both in its chain of thought and its actual outputs and interestingly you can see it's still able to call tools and stuff. It's purely sort of its demeanor and there are other features that we found for interesting things like concision, so that's more of a practical one. You can make it more concise, the types of programming languages it uses. But yeah, as we're seeing it come.
[24:42]
A
In pretty good output scheduler code is actually wild.
[24:48]
B
Yo.
[24:49]
A
This code is actually insane, bruh.
[24:51]
D
Be cringe. Ngl. What's the process of training an SAE on this? Or how do you label features? I know you guys put out a pretty cool blog post about finding this autonomous interp. Something about how agents for interp is different than coding agents. I don't know. While this is spewing up, how do we find feature 43205?
[25:15]
B
Yeah, so in this case, our platform that we've been building out for a long time now supports all the sort of classic out of the box interp techniques that you might want to have SAE training, probing, things of that kind. I'd say the techniques for vanilla SAEs are pretty well established now where you take your model that you're interpreting, run a whole bunch of data through it, gather activations and then yeah, pretty straightforward pipeline to train an sae. There are a lot of different varieties. There's top K saes, batch top K ses, normal relu saes and then once you have your sparse features to your point, assigning labels to them to actually understand that this is a Gen Z feature. That's actually where a lot of the kind of magic happens. And the most basic standard technique is look at all of your input dataset examples that cause this feature to fire most highly and then you can usually pick out a pattern. So for this feature, if I've run a diverse enough data set through my model, feature 43205 probably tends to fire on all the tokens that sound like Gen Z slang. And so you know, you could have a human go through all 43,000 concepts and look at the pattern. But to automate that, you just kind of hand those examples off to a Frontier LLM and ask it to identify that pattern.
[26:41]
D
And I've got to ask the basic question, you know, can we get examples where it hallucinates? Pass it through, see what feature activates for hallucinations. Can I just, just turn hallucination down?
[26:51]
C
Oh wow, I solved it. You really predicted a project we're already working on right now, which is detecting hallucinations using interpretability techniques. And this is interesting because hallucinations is something that's very hard to detect and it's kind of a hairy problem and something that black box methods really struggle with. Whereas like Gen Z, you could always train a simple classifier to detect that hallucinations is harder. But we've seen that that models internally have some awareness of uncertainty or some sort of User pleasing behavior that leads to hallucinatory behavior. And so yeah, we have a project that's trying to detect that accurately and then also working on mitigating the hallucinatory behavior in the model itself as well.
[27:40]
A
Yeah. And I would say most people are still at the level of like, oh, I would just turn temperature to zero and that turns off hallucination. And I'm like, well that's a fundamental misunderstanding of how this works.
[27:50]
B
Yeah. Part of what I like about that question is there are SAE based approaches that might help you get at that. But oftentimes the beauty of SAEs and like we said, the curse is that they're unsupervised. So when you have a behavior that you deliberately would like to remove, and that's more of like a supervised task, often it is better to use something like probes and specifically target the thing that you're interested in reducing as opposed to hoping that when you fragment the latent space, one of the vectors that pops out will be the thing you're interested in.
[28:22]
D
And as much as we're training an autoencoder to be sparse, we're not for sure certain that we will get something that just correlates to hallucination. You'll probably split that up into 20 other things and who knows what they'll be of course.
[28:37]
B
Right. Yeah. So there's known problems with feature splitting and feature absorption and then there's the off target effects.
[28:44]
A
Right.
[28:44]
B
Ideally you would want to be very precise where if you reduce the hallucination feature suddenly, maybe your model can't write creatively anymore and maybe you don't like that, but you want to still stop it from hallucinating facts and figures.
[28:56]
A
Good. So Vivo has a paper to recommend there that we'll put in the show notes. But yeah, I guess just because your demo is done. Any other things that you want to highlight or any other interesting features you want to show?
[29:08]
B
I don't think so. Yeah, like I said, this is a pretty small snippet. I think the main sort of point here that I think is exciting is that there's not a whole lot of inter being applied to models quite at this scale. You know, Anthropic certainly has some research and yeah other teams as well, but it's nice to see these techniques being put into practice. I think not that long ago the idea of real time steering of a trillion parameter model would have sounded.
[29:33]
A
Yeah. The fact that it's real time like you started the thing and then you edited the steering vector, I think it's an interesting one. TBD what the actual production use case would be on that Real time editing.
[29:46]
D
That's the fun part of the demo, right? You can kind of see how this could be served behind an API, right?
[29:51]
B
Yes.
[29:52]
D
You only have so many knobs and you can just tweak it a bit more. And I don't know how it plays in. People haven't done that much with how does this work with or without prompting? How does this work with fine tuning? There's a whole hype of continual learning, so there's just so much to see. Is this another parameter? Is it like parameter? We just kind of leave it as a default. We don't use it. So I don't know. Maybe someone here wants to put out a guide on how to use this with prompting. When to do what?
[30:19]
B
Well, I have a paper recommendation that I think you would love from ecdeep on our team who is an amazing researcher. Just can't say enough amazing things about eckdeep. But he actually has a paper as well as some others from the team and elsewhere that go into the essentially equivalence of activation steering and in context learning and how those are from a. He thinks of everything in a cognitive neuroscience Bayesian framework. But basically how you can precisely show how prompting in context learning and steering exhibit similar behaviors and even get quantitative about the magnitude of steering you would need to do to induce a certain amount of behavior similar to certain prompting. Even for things like jailbreaks and stuff. It's a really cool paper.
[31:09]
A
Are you saying steering is less powerful than prompting?
[31:13]
B
More like you can almost write a formula that tells you how to convert between the two of them.
[31:21]
C
It has to be formally equivalent actually in the limit.
[31:24]
B
Right. So one case study of this is for jailbreaks. There I don't know. Have you seen the stuff where you can do many shot jailbreaking? You flood the context with examples of the behavior topic.
[31:38]
A
Put out that paper. A lot of people were like yeah, we've been doing this guys.
[31:43]
B
What's in this? In context learning and activation steering equivalence paper is you can predict the number of examples that you will need to put in there in order to jailbreak the model.
[31:54]
A
That's cool.
[31:55]
B
By doing steering experiments and using this sort of like equivalence mapping.
[32:00]
A
That's cool. That's really cool.
[32:01]
B
It's very neat.
[32:02]
A
Yeah, I was going to say I can back rationalize that this makes sense because what context is is basically just it updates the kvcache kind of and then every next token inference is still the sheer sum of everything, all the weights plus all the context up to date. And you could, I guess theoretically steer that with. You could probably replace that with your steering. The only problem is steering typ on one layer, maybe three layers like you did. So it's not exactly equivalent.
[32:34]
B
Right. You need to get precise about how you sort of define steering and how you're modeling the setup. But yeah, I've got the paper pulled up here. Belief Dynamics Reveal the Dual nature. Yeah, the title is Belief Dynamics Reveal the Dual Nature of In Context, Learning and Activation Steering. So Eric Bigelow, Dana Wergraft on the who are are doing fellowships at Goodfire. Eck Diep's the final author there.
[33:00]
C
I think actually to your question of what is the production use case of steering? I think maybe if you just think one level beyond steering as it is today. Imagine if you could adapt your model to be an expert legal reasoner in almost real time very efficiently using human feedback or using your semantic understanding of what the model knows and where it knows that behavior. I think that while it's not clear what the product is, at the end of the day it's clearly very valuable. Thinking about what's the next interface for model customization and adaptation is a really interesting problem for us. We have heard a lot of people actually interested in fine tuning an RL for open weight models in production. And so people are using things like Tynker or kind of like open source libraries to do that. But it's still very difficult to get models fine tuned and RL'd for exactly what you want them to do unless you're an expert at model training. And so that's like something we're looking into.
[34:08]
A
Yeah, I never thought so. Tinker from Thinking Machines famously uses Rank one Lora. Is that basically the same as steering? What's the comparison there?
[34:20]
B
Well, so in that case, you are still applying updates to the parameters, right? Yeah.
[34:26]
A
You're not touching a base model, you're touching an adapter kind of.
[34:31]
B
Right. But I guess it still is more in parameter space than I guess it's maybe like are you modifying the pipes or are you modifying the water flowing through the pipes to get what you're after? Just maybe one way.
[34:44]
C
I like that analogy.
[34:46]
B
That's my mental map of it at least. But it gets at this idea of model design and intentional design, which is something that we're very focused on. And just the fact that I hope that we look back at how we're currently training models and post training models and just think what a primitive way of doing that. Right now there's no intentionality really in.
[35:08]
A
It's just data.
[35:09]
C
Right.
[35:09]
A
The only thing we can control is what data we feed in.
[35:12]
B
So Dan from Goodfire likes to use this analogy of he has a couple of young kids and he talks about what if I could only teach my kids how to be good people by giving them cookies or giving them a slap on the wrist if they do something wrong. Not telling them why it was wrong or what they should have done differently or something like that. Just figure it out. Right? Exactly. Exactly.
[35:35]
A
So that's rl.
[35:36]
B
Yeah. Right. And it's sample inefficient. There's. What do they say? It's like slurping feedback. It's like slurping supervision. And so you'd like to get to the point where you can have experts giving feedback to their models that are internalized and steering is an inference time way of sort of getting that idea. But ideally you're moving to a world where it is much more intentional design in perpetuity for these models.
[36:09]
D
This is one of the questions we asked Emmanuel from Anthropic on the podcast a few months ago. Basically the question was, you're at a research lab that does model training, foundation models, and you're on an interp team. How does it tie back? Do ideas come from the pre training team? Do they go back? For those interested, you can watch that. There wasn't any. Wasn't too much of a connect there, but it's still something. It's something they want to push for down the line.
[36:35]
B
It can be useful for all of the above. There are certainly post hoc use cases where it doesn't need to touch that.
[36:42]
D
I think the other thing a lot of people forget is this stuff isn't too computationally expensive. Right. I would say if you're interested in getting into research, Mechinterp is one of the most approachable fields. Right. A lot of this training, SAE trainer probe this stuff, like the budget for this one. There's already a lot done. There's a lot of open source work. You guys have done some too.
[37:05]
A
There's notebooks from the Gemini team, from Neil Nanda, or this is how you do it, just through the notebook.
[37:10]
D
Even if you're not even technical with any of this, you can still make progress there. You can look at different activations. But if you do want to get into training, training this stuff, correct me if I'm wrong, is in the thousands of dollars. It's not that high scale. And then same with applying it doing it for post training RL all this stuff is fairly cheap in scale of okay, I want to get into model training. I don't have compute for pre training stuff so it's a very nice field to get into. And also there's a lot of open questions, right? There's so many questions we have. Some of them have to go with okay I want a product, I want to solve this. There's also just a lot of open ended stuff that people could work on that's interesting right. I don't know if you guys have any calls for like what's open questions, what's open work that you either open collaboration with or you just like to see solved or just for people listening that want to get into mechinterp because people always talk about it. What are things they should check out start of course join you guys as well. I'm sure you're hiring.
[38:09]
C
There's a paper I think from was it Lee Sharkey it's Open Problems and Interpretability which I recommend everyone who's interested in the field read just a really comprehensive overview of what are the the things that experts in the field think are the most important problems to be solved. I also think to your point it's been really really inspiring to see I think a lot of young people getting interested in interpretability actually not just young people also like scientists who have been you know, experts in physics for many years and in biology or things like this transitioning into inter because the barrier to entry is in some ways is low and there's a lot of information out there and ways to get started. There's this anecdote of professors at university saying that all of a sudden every incoming PhD student wants to study interpretability which was not the case a few years ago. So it just goes to show how I guess exciting the field is, how fast it's moving, how quick it is to get started and things like that.
[39:11]
B
And also just a very welcoming community. There's an open source mechanism Slack Channel there were people are always posting questions and just folks in the space are always responsive if you ask things on various forums and stuff. But yeah the Open Problems paper is.
[39:28]
C
A really good one for other people who want to get started. I think MATS is a great program. What's the acronym for Machine Learning and Alignment Theory Scholars.
[39:38]
D
I think it's like the normally summer internship style.
[39:43]
C
Yeah but they've been doing it year round now and actually a lot of our our full time staff have come through that program or gone through that program and it's great for anyone who is transitioning into interpretability. There's a couple other fellows programs. We do one as well as anthropic and so those are great places to get started if anyone is interested.
[40:04]
B
Also I think been seen as a research field for a very long time, but I think engineers are sorely wanted for interpretability as well, especially at Goodfire but elsewhere as it does scale up.
[40:19]
A
I should mention that Lee actually works with you guys right in the London office and I'm adding our first ever mecanturp track at AIE Europe because I see this industry applications now emerging and I'm pretty excited to help push that along. Yeah, it'll effectively be the first industry mechinterp conference.
[40:40]
B
Yeah, I'm so glad you added that.
[40:43]
A
But it's still a little bit of a bet. It's not that widespread, but I can definitely see this is the time to really get into it. You want to be early on things.
[40:51]
B
For sure and I think the field understands this at icml, I think title of the mechinterp workshop this year was Actionable Interpretability and there was a lot of discussion around bringing it to various domains.
[41:05]
A
Everyone's adding pragmatic, pragmatic, actionable, whatever. Okay, well we weren't actionable before, I guess. I don't know.
[41:14]
D
And I mean just, you know, being in Europe, you see the Interp room one like old school conferences. I think they had a very tiny room till they got lucky and they got it doubled. But there's definitely a lot of interest, a lot of niche research. So you see a lot of research coming out of university students. We covered the paper last week. It's like two unknown authors, not many citations. But you know, you can make a lot of meaningful work there.
[41:38]
A
One thing I did want to call out because I think people haven't really mentioned this yet is just interf for code I think is like an abnormally important field. We haven't mentioned this yet. The conspiracy theory last two years ago was when the first SAE work came out of Anthropic was they were like, oh, we just used SAEs to turn the bad code vector down and then turn up and then turn up the good code code. And I think isn't that the dream basically? I guess maybe. Why is it funny? If it was realistic, it would not be funny. It would be like, no, actually we should do this. But it's funny because we know there's like, we feel there's some limitations to what steering can do and I think A lot of the public image of steering is like the Gen Z stuff like oh, you can make it really love the Golden Gate Bridge or you can make it speak like Gen Z. To be a legal reasoner seems like a huge stretch and I don't know if that will get there this way.
[42:37]
C
Yeah, I think I will say we are announcing something very soon that I will not speak too much about. But I think yeah, this is like what we've run into again and again is like we don't want to be in the world where steering is only useful for like stylistic things. That's definitely not what we're aiming for. But I think the types of interventions that you need to do to get to things like legal reasoning are much more sophisticated and require breakthroughs and in learning algorithms.
[43:08]
A
And is this an emergent property of scale as well?
[43:11]
C
I think so, yeah. I mean I think scale definitely helps. I think scale allows you to learn a lot of information and reduce noise across large amounts of data. But I also think we think that there's ways to do things much more effectively even at scale. So like actually learning exactly what you want from the data and not learning things that you don't want exhibited in the data. So. So we're not anti scale but we are also realizing that scale is not going to get us to the type of AI development that we want to be at in the future. As these models get more powerful and get deployed in all these sorts of mission critical contexts. Current life cycle of training and deploying and evaluations is to us deeply broken and has opportunities to improve. So more to come on that very, very soon.
[44:03]
B
And I think the SAEs basically maybe are just like a proof point that these concepts do exist if you can manipulate them in the precise best way you can get the ideal combination of them that you desire. And steering is maybe the most coarse grained sort of peek at what that looks like. But I think it's evocative of what you could do if you had total surgical control over every concept, every parameter. Yeah, exactly.
[44:30]
C
There were like bad code features.
[44:32]
D
I've got it pulled up just coincidentally as you guys are talking.
[44:36]
A
So this was like, this is exactly what people would thought it.
[44:39]
D
There's like specifically a code error feature that activates and they show it off. It's not typo detection, it's like it's typos in code. It's not typical typos and you know, you can see it clearly activates where there's something wrong in code and they have like malicious Code, code error. They have a whole bunch of sub broken down little grain features.
[45:02]
C
Yeah.
[45:03]
A
So the rough intuition for me why I talked about post training was that, well, you just have a few different rollouts with all these things turned off and on and whatever. And that's synthetic data you can kind of post train on.
[45:13]
C
Yeah.
[45:14]
D
And I think we make it sound easier than it is. Just saying they do the real hard work.
[45:19]
C
I mean you guys have the right idea. Exactly. Yeah. We replicated a lot of these features in our llama models as well. I remember there was like.
[45:27]
D
And I think a lot of this stuff is open, right? Like yeah, you guys opened yours. DeepMind has opened a lot of essays on Gemma. Even Anthropic has opened a lot of this. There's a lot of resources that we can probably share of people that want.
[45:41]
A
To get involved and special shout out to Neuronpedia as well. Yes, Amazing piece of work to visualize those things.
[45:49]
C
Yeah, exactly.
[45:51]
A
I guess I wanted to pivot a little bit onto the healthcare side because I think that's a big use case for you guys. We haven't really talked about it yet. This is a bit of a crossover for me because we do have a separate science pod that we're starting up for AI for Science. Just because it's such a huge investment category and also I'm less qualified to do it. We actually have biophds to cover that, which is great. But I need to just kind of recap your work maybe on the Evil 2 stuff but. And then building forward.
[46:18]
B
Yeah, for sure. And maybe to frame up the conversation. I think another kind of interesting just lens on interpretability in general is a lot of the techniques that we're describing are ways to solve the AI human interface problem. And it's sort of like bidirectional communication is the goal there. So what we've been talking about with intentional design of models and steering, but also more advanced techniques is having humans impart our desires and control into models and over models. And the reverse is also very interesting, especially as you get to superhuman models, whether that's narrow superintelligence like these scientific models that work on genomics, data, medical imaging, things like that. But down the line, you know, superintelligence of other forms as well. What knowledge can the AIs teaching us as sort of the other direction in that. And so some of our life science work to date has been getting at exactly that question, which is, well, some of it does look like debugging these various life sciences models, understanding if they're actually performing well on tasks or if they're picking up on spurious correlations. For instance, genomics models, you would like to know whether they are sort of focusing on the biologically relevant things that you care about or if it's using some simpler correlate like, like the ancestry of the person that it's looking at. But then also in the instances where they are superhuman and maybe they are understanding elements of the human genome that we don't have names for or specific, you know. Yeah. Discoveries that they've made that we don't know about, that's a big goal. And so we're already seeing that. Right. We are partnered with organizations like Mayo Clinic, leading research health system in the United States, AHRQ Institute, as well as a startup called primamenta which focuses on neurodegenerative disease. And in our partnership with them, we've used foundation models. They've been training and applied our interpretability techniques to find novel biomarkers for Alzheimer's disease. So I think this is just the tip of the iceberg. But it's, that's like a flavor of some of the things that we're working on.
[48:36]
A
Yeah, I think that's really fantastic. Obviously we did the Chad Zuckerberg pod last year as well and there's a plethora of these models coming out because there's so much potential and research. And it's very interesting how it's basically the same as language models but just with a different underlying data set. But it's the same exact techniques. There's no change basically.
[49:00]
B
Yeah. And even in other domains. Right. Robotics, I know like a lot of the companies just use Gemma as like the like backbone and then they like make it into a VLA that like takes these actions. It's, it's, it's transformers all the way down.
[49:15]
D
So yeah, like we, we have Medjema now. Right. Like this week even there was Medjema 1.5 and they're, they're training it on this stuff like 3D scans, medical domain knowledge and all that stuff too. So there's a push from both sides. But I think the thing that, you know, one of the things about mechinterp is you're a little bit more cautious in some domains. Right. So healthcare mainly being one guardrails understanding, we're more risk averse to something going wrong there. So even just from a basic understanding, if we're trusting these systems to make claims, we want to know why and what's going on.
[49:51]
C
Yeah, I think there's totally A kind of deployment bottleneck to actually using foundation models for real patient usage or things like that. Like say you're using a model for rare disease prediction. You probably want some explanation as to why your model predicted a certain outcome and interpret an explanation at that. So that's definitely a use case. But I also think being able to extract scientific information that no human knows to accelerate drug discovery and disease treatment and things like that actually is a really, really big unlock for scientific discovery. And you've seen a lot of startups say that they're going to accelerate scientific discovery. And I feel like we actually are doing that through our interp techniques and kind of like almost by accident. I think we got reached out to very, very early on from these healthcare institutions and none of us had healthcare.
[50:49]
A
How did they even hear of you?
[50:51]
C
Like a podcast. Oh, okay.
[50:52]
A
Yeah, podcast.
[50:53]
D
Okay, well now is that time, you.
[50:55]
C
Know, everyone can call us up.
[50:57]
A
Podcasts are the most important thing. Everybody should listen to podcast and everybody should come on.
[51:01]
C
They were like, you know, we have these really smart models that we've trained and we want to know what they're doing. And we were like really early that time, like three months old and it was a few of us and we were like, oh my God, let's. We've never used these models, let's figure it out. But it's also like great proof that interrupt techniques scale pretty well across. Across domains. We didn't really have to learn too much about.
[51:22]
A
Interp is a machine learning technique, machine learning skills everywhere.
[51:25]
D
Right.
[51:26]
A
And obviously it's just a general insight probably to finance too, which would be fun for our history. I don't know if you have anything to say there.
[51:35]
B
Yeah, well, just across the science, we've also done work on materials science. Yeah, it really runs the gamut. Yeah.
[51:42]
D
Awesome. For those that should reach out. Like, you're obviously experts in this, but is there a call out for people that you're looking to partner with design partners, people to use your stuff outside of just the general developer that wants to plug and play steering stuff on the research side more. So are there ideal design partners, customers, stuff like that? That should be true.
[52:04]
C
Yeah. I can talk about maybe non life sciences and then I'm curious to hear from you on the life sciences side. But we're looking for design partners across. Across many domains. Anyone who's customizing language models or trying to push the frontier of code or reasoning models is really interesting to us. And then also interested in the frontier of models that work in pixel space as we Call it. So if you're doing world models, video models, even robotics, where there's not a very clean natural language interface to interact with, I think we think that interp can really help and are looking for a few partners in that space.
[52:44]
A
Just because you mentioned the keyword world models, is that a big part of your thinking? Do you have a definition that I can use? Because everyone's asking me about it, about.
[52:54]
D
World models, there's quite a few definitions, I'd say.
[52:57]
C
I don't feel equipped to be an expert on world model definitions, but the reason we're interested in them is because. Because they give you. With language models, when you get features, you still have to do auto interp and things like that to actually get an understanding of what this concept is. But in image and video and world, it's extremely easy to grok what the concept is because you can see it and you can visualize it, and this makes the feedback cycle extremely fast for us. Also for things like, I don't know, if you think about probes in language model context and then. And take it to world models, what if you wanted to detect harmful actors in world model scenes? You can't actually go and label all of that data feasibly, but maybe you could synthetically generate, I don't know, harmful actor data using SAE feature activations or whatever, and then actually train a probe that was able to detect that much more scalably. So I just think, like, video and image and world has always been something we've explored and are continuing to explore. Mark's demo was probably the first moment we really, like. We're like, oh, wow. Like, this is really gonna. This could really, like change the world.
[54:14]
A
Of the steering demo.
[54:15]
C
Yeah, no, the image demo.
[54:16]
A
The diffusion one. Yeah, exactly. Yeah. We should probably show that. And you demoed it at World's Fair, so we can just link that.
[54:23]
C
Nice. Yeah.
[54:24]
D
People can play with it, right?
[54:25]
B
Yes. Paint. I couldparry. AI.
[54:28]
A
Yeah. I think for me, one way in which I think about world models is just like having this consistent model of the world where everything that you generate operates within the rules of that world. And imagine it would be a bigger deal for science or math or anything where you have verifiable rules. Whereas I guess in natural language maybe there's less rules. And so it's not that important.
[54:53]
B
Yeah. Which makes the debugging of the model's internal representations or its internal world model, to the extent you can make that legible and explicit and have control over that, I think it makes it all the more important, because in language, it's sort of a fuzzy enough domain that if its world model isn't fully like ours, it can still sort of pass the Turing test, so to speak. But I know there have been papers that have looked at. Even if you train certain astrophysics models, it does not learn f MA the same way that you can have a model do well for modular arithmetic. But it doesn't really learn how we think of modular arithmetic. It learns some crazy heuristic that is essentially functionally equivalent. But it's probably not the sort of groked solution that you would hope for.
[55:42]
A
It's how an alien would do it.
[55:43]
B
Right? Exactly.
[55:45]
A
But, no, that's probably, I think, a function of our learning being bad rather than, well, that approach probably not being real because it's how we humans learn.
[55:56]
B
Right?
[55:56]
A
Yeah. Right.
[55:57]
B
Well, it's the problem of induction. Right. All of ML is based on induction. And it's impossible to say. I have a physics model. You might have a physics model that works all the time, except when there is a character wearing a blue shirt and green shoes, and you can't disprove that that's the case unless you test every particular situation your model might be in. So we know that the laws of physics apply no matter where you are, what scenario it is. But from a model's perspective, maybe something that's out of distribution. It just never needed to learn that the same laws of physics apply there.
[56:31]
A
You are very excited because I read Tetchang over the holidays, and I was very inspired by this short story called Understand, which apparently is pretty old. You must be familiar with it. To me, it was like, it's this fictional story. It's like the inverse of Flowers for Algernon, where you had someone get really smart, but then also try to outsmart the tester and the story. Just read the chain of thought of a super intelligence where they're like, oh, I realize I'm being tested, therefore. And okay, what's the consequence of being tested? Oh, they're testing me. And if I score well, they will use me for things that I don't want to do. Therefore, I will score badly, but not too badly that they will raise alarms. So model sandbagging is a thing that people have explored, but I just think Ted Chang's work, just in general, seems to be something that inspires you. I just wanted to prompt you to.
[57:22]
B
Talk about it, I think so. Ted Chang is a sci fi author who writes amazing short stories.
[57:29]
A
His other claim to fame is Stories of Our Lives, which became the movie Arrival.
[57:33]
B
Exactly. Yeah. So two books of short stories that I'm aware of. He also actually also has a great just online blog post. I think he's the one who coined the term of LLMs as a blurry JPEG of the Internet. I should fact check that, but it's a good post. But I think almost every one of his short stories has some lesson to bear on thinking about AI and thinking about AI research. So you've been talking about alien intelligence. Right. And this AI human communication translation problem. That's exactly sort of what's going on in Arrival, in Story of your Life. And just the fact that other beings will think and operate and communicate in ways that are not just challenging for us to understand, but just fundamentally different in ways that we might not even be able to expect. And then the one that's just super relevant for interpretability is the other short short book of short stories he has is called Exhalation. And that is literally about a robot doing interpretability on its own mind.
[58:34]
A
Oh, okay.
[58:36]
B
So I just think that you don't even have to squint to make the analogies there.
[58:41]
A
Well, I actually take Exhalation as a discussion about entropy and order. But yes, there's a scene in Exhalation where basically everyone is a robot. So the guy realizes he can set up a mirror to work on the back of his own head and then starts doing operations like that. That and looking at the mirror and doing this.
[59:01]
B
Yeah, And I think Ted Chiang has written about the inspiration for that story. It was half inspired by some of the things he had been doing on Entropy. There's apparently some other short story that is similar where a character goes to the doctor and opens up his chest and there's a ticker tape going along. He basically realizes he's a Turing machine. And I don't know, I think especially as it comes to using agents for inter. That story always sticks in my mind.
[59:28]
C
I find the brains surgery or surgery analogy is a little bit morbid, but it is very apt. And when we talk to a lot of computational neuroscientists, they moved to INTERP because they were like, look, we have unfettered access to this artificial intelligent mind. It's so much. You have access to everything. You can run as many ablations, experiments as you want. It's an amazing bed for science and human brains. Obviously we can't just go and do whatever we want to them. And I think it is really just a moment in time where we have intelligent systems that can really do things Better than humans in many ways. And it's time, I think, for us to do the science on it.
[60:14]
A
I'll ask a brief safety question. Mechanic was kind of born out of the alignment and safety conversation. Safety is on your website side. It's not like something that you deprioritize, but there's a sort of very militant safety arm that wants to blow up data centers and stop AI and then there's this sort of middle ground and is this a conversation in your part of the world? Do you go up to Berkeley and Lighthaven and talk to those guys or are they there's like a brief civil war going on or no?
[60:46]
C
I think a good amount of us have spent some time in Berkeley and then there are researchers there that we really admire and respect. I think for us it's like we have a very grounded view of alignment and safety in that we want to make sure that we can build models that do what we want them to do and that we have scalable oversight into what these models are doing. And we think that that is the key to a lot of these technical alignment challenges. And I think that is our opinion. That's our research direction. We of course are going to do safety related research to make sure that our techniques also work on things like reward hacking and other more concrete safety issues that we've seen in the wild. But we want to be kind of grounded in solving the technical challenges. We see to having humans play a big role in the deployment of these super intelligent agents of the future.
[61:47]
B
Future, yeah. I've found the community to actually be remarkably cohesive. Whether it's talking about academia or the interpretability work being done at the Frontier Labs or some of the independent programs like MATS and stuff. I think we're all shooting for the same goal. I don't know that there's anyone who doesn't want our understanding of models to increase. I think everyone, regardless of where they're coming from or the use cases, that they're thinking, whether it's alignment as the, the premier thing they're focused on or someone who's coming in purely from the angle of scientific discovery. I think we would all hope that models can be more reliably and robustly controlled and understood. It seems like a pretty unambiguous goal.
[62:29]
A
I'll maybe phrase it in terms of there's maybe like a U curve of this where if you're extremely doomer, you don't want any research whatsoever. If you're mildly doomer, you're like, okay, this high agency doomer is it's like, well, the default path is we're all dead, but we can do something about it. Whereas there's other people who are like, no, just don't ever do anything.
[62:50]
B
Yeah.
[62:51]
D
There's also the other side. There is the super alignment people that are like, okay, weak to strong generalization. We're going to get there. We're going to have models smarter than us and use those to train even smarter models. How do we do that safely? There's the camp there too that's trying to solve it. But yeah, there's a lot of doomers too.
[63:13]
B
And I think there's a lot to be learned from taking a very even regardless of the problems that you're applying this to. Also just the notion of scalable oversight as a method of saying let's take super intelligent or current frontier models and help use them to understand other models is another case where I think it's just a good lesson that everyone is aligned on of ideally you are setting up your research so that as superintelligence arrives, that is a tailwind. That's also bolstering our ability to understand the models because otherwise you're fighting a losing battle. If it's like the systems are getting more and more capable and our methods are sort of linearly growing at human pace.
[63:58]
A
Yeah. VIVA did call out something like, I do think a consistent part of the mechanism interp field is consistently strong to weak, meaning that we train weaker models to understand strong models. Something like that. Or maybe I got it the other way around. Yeah. The question that Ilya and Janlaika posed was, well, is that going to scale? Because eventually these are going to be stronger than us. Right. So I don't know if you have a perspective on that because that is something I still haven't got over after seeing that.
[64:28]
D
There's a good paper from OpenAI, but it's somewhat old. I think it's like 23, 24. It's literally weak to strong generalization. But the thing is, is that most of OpenAI super alignment team has. They're gone, they're gone.
[64:39]
B
But I think the idea is there's now more.
[64:43]
A
They're still back.
[64:45]
B
I think there's some new blog posts coming out.
[64:48]
A
I know. And just check the Thinking Machines website to see who's back.
[64:52]
D
There's more coming.
[64:54]
A
You know what I mean? We too strong seemed like a very different direction when it first came out. I was like, oh my God, this is what we have to do. And it may be completely different than all the techniques that we have today.
[65:06]
B
Yeah. My understanding of that is that's more like weak to strong. When you trust the weak model and you're uncertain whether you can trust the strong model that's being developed. I'm sort of speaking out of my depth on some of these topics, but I think right now we're in a regime where even the strong models we trust as reasonably aligned and so they can be good, good co scientists on a lot of the problems that we've been, we've been tackling, which is a nice, a nice state to be in.
[65:35]
A
Yeah. Any last thoughts close to action?
[65:39]
B
I don't think so. As we mentioned, actively hiring MLE's research scientists. You can check out the careers page at Goodfire.
[65:46]
D
Where are you guys based?
[65:47]
C
San Francisco. We're in Levi's Plaza, like by Court Tower. That's where our office is. So come hang out. We're also looking for design partners across people working in reasoning models, world models, robotics, and then also of course people who are working on building super intelligent science models or looking at drug discovery or disease treatment. We would love to partner as well.
[66:14]
A
Yeah, maybe the way I'll phrase it is like maybe you have a use case where LLMs are almost good enough but you need one magical knob to tune so that, that it is good enough. You guys make the knob?
[66:26]
B
Yeah, yeah. Or foundation models in other domains as well. Some of those are the especially opaque ones because you can't chat with them.
[66:36]
A
What do you do if you can't chat with them?
[66:38]
B
Oh well, thinking about a genomics model or a material science model. So like a. Yeah, a narrow foundation model.
[66:44]
A
Yeah, they predict. Yeah, got it.
[66:45]
D
I was going to say I thought the diffusion work you guys did earlier was pretty fun. Like you could see it directly applied to images, but we don't see as much internal TERP in diffusion or images. Right. Like I see genomics, it's going to be huge.
[66:58]
A
Look at these video models. They're so expensive to produce. Basically a mid journey SREF is kind of a feature, right?
[67:06]
B
The what?
[67:07]
A
Midjourney sref. Like the string of numbers that you. The style reference, I guess.
[67:12]
B
Yeah. No, I mean I think we're starting to see more of it. And I'll say the research preview of our diffusion model, kind of like a creative use case and the steering demo user saw. I think of those much more as demos than a lot of the sort of core platform features that we're working with. Partners are unfortunately sort of under NDA and less demo able. But I will hope that you're going to see interpervading a lot of what gets done, even if it is behind the scenes like that. So some of the public facing demos might not always be representative of the. It's just the tip of the iceberg, I guess, is one way to put it.
[67:47]
A
Okay, excellent. Thanks for coming on.
[67:49]
B
Thanks for having us. This is a great time.