
Hosted by Jellypod · EN

When you fine-tune an AI model, what changes inside doesn't predict what changes outside. This week on Inside the Black Box, I break down why — and what it means for anyone auditing or regulating these systems.

This episode explores why smooth, coherent language can feel more credible than it is, and how processing fluency, familiarity, and authority cues shape what we believe. It also digs into why conversational AI is especially persuasive, from polished explanations to confident-sounding confabulations.

The Heuristic Loop You Can't Break from Inside

This episode explores fluency-as-validity: the way polished AI responses can make us feel like the work of judgment is already done. It also looks at why large language models are so effective at creating the sensation of clarity, and why mechanistic interpretability may be a way to push back against that enchantment.

716 features fire on both Seneca and Marcus Aurelius but stay dark for ad copy. The model learned Stoic philosophy, not just an author's style. Plus: why 'inert' features aren't all the same thing.

We trained a fresh LoRA on the letters of Seneca and ran the same analysis pipeline we used on Marcus Aurelius and advertising copy. Every structural finding replicated. The model organizes its adaptation into five clusters: one tight (features moving in lockstep) and four loose (features cooperating more independently). Seneca produced the cleanest clustering we've measured and the strongest workhorse cluster, a group of 141 features encoding philosophical argumentation with a causal effect more than three times stronger than anything in Marcus. Done in collaboration with John Holman.

We replicated our Marcus Aurelius findings at a new layer, then threw the whole method at 12 commercial ad copy styles trained into a single LoRA. The patterns held, and the new domain revealed something we couldn't have seen before: the model organizes its adaptations by register family, not by individual style.

We opened the 65%. The features that resisted interpretation one at a time turned out to organize into five co-activation clusters with clear thematic identities and causal effects nearly ten times stronger than any individual feature. Second in a series with John Holman.

A concise, single-segment episode of Inside the Black Box: Cracking AI and Deep Learning where Arshavir Blackwell explains, in one continuous narrative, what neural networks are, how their simple units combine into powerful systems, and how learning by backpropagation sculpts their behavior. This short episode is designed as an elegant, one-paragraph-style monologue that introduces listeners to neural nets without equations or jargon.

This episode of Inside the Black Box: Cracking AI and Deep Learning tells the story of an unexpected convergence in the history of language and AI. In 1995, Peter Bensch noticed that Zelig Harris, a mid‑century structural linguist, and Jeff Elman, a pioneer of simple recurrent networks, had independently uncovered the same deep insight about language: structure lives in patterns of use.Arshavir Blackwell, PhD, guides listeners through Harris’s world of distributional linguistics and operator grammar—where you infer structure from where words can substitute for one another—and contrasts it with Elman’s tiny recurrent neural networks that learn to predict the next word. Along the way, we see how these very different traditions arrive at the same place: hidden geometric structure in how language is used.From there, the episode bridges to today’s large language models and mechanistic interpretability, asking a deceptively simple question: what counts as "structure" inside a model? We explore how patterns, clusters, and features relate to genuine internal organization, and why Harris and Elman’s convergence still shapes how we think about circuits, features, and the geometry of meaning in modern AI.