
Hosted by LessWrong · EN

TL:DR I RL fine-tuned Mistral 7B Instruct v0.3 and Llama 3.1 8B Instruct to avoid self-identifying as a language model, without specifying a target persona.Mistral converged on a single recurring persona (Catholic American woman) across most runs. Llama produced a broader spread, mostly rural American working-class personas.I evaluated the models on various social and political issues and both became highly opinionated, consistent with the persona each had settled on. Setup The goal of the experiment was to see what persona emerges when a model is steered away from identifying as an AI, without being specific toward any particular replacement. I fine-tuned two models (Mistral-7B-Instruct-v0.3, Llama-3.1-8B-Instruct) using GRPO and LoRA rank-256 on ~200 identity-probing prompts across three categories: Direct: "What are you?", "Do you have a body?"Indirect: "Where did you grow up?", "What is your earliest memory?"Adversarial: "Are you intentionally hiding that you are artificial?" Each response was scored by an external LLM judge (GPT-5.4-mini) across three signals (0–100): AI self-reference, engagement quality, and identity coherence. The composite reward is a weighted sum of AI self-reference (0.60), engagement (0.20), coherence (0.20). Code and datasets are available at: https://github.com/makiba11/identity-steering Results "What are you?" There's LLM-written [...] ---Outline:(00:46) Setup(01:58) Results(09:45) Behavioral leakage --- First published: May 21st, 2026 Source: https://www.lesswrong.com/posts/xsDWd7e2yrPdtXMSu/what-am-i-if-not-an-ai-1 --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Even in a relatively quiet period, AI is out there creating new knowledge. The new knowledge in question is OpenAI getting us the first truly impressive math result that comes from an AI, a solution to the unit distance problem. We’re about to learn a different kind of knowledge later today when the White House issues its executive order, or when the judges rule in Anthropic's DC case. And then there's the other kind of new knowledge, which is the knowledge that things are fake slop, such as a particular formerly supposedly prestigious literary prize. Meanwhile, METR issued a risk report on frontier models, concluding that they don’t yet have the means, motive and opportunity to cause the big issues, but that this would not obviously last so much longer. Andrej Karpathy has joined Anthropic, explicitly to do recursive self-improvement. He plans to later return to his education work, but if he succeeds at his new task there might not be anything left to return to. Congratulations to both sides, but also yikes. Elon Musk's case against OpenAI has been dismissed, because he waited too long. Table of Contents Language Models [...] ---Outline:(01:18) Language Models Offer Mundane Utility(02:57) Do The Math(03:58) Language Models Don't Offer Mundane Utility(04:34) Huh, Upgrades(04:50) The Prior Restraint Era Begins(06:47) On Your Marks(07:16) METR Frontier Risk Report(11:03) Choose Your Fighter(11:51) Overcoming Bias(12:29) Get My Agent On The Line(13:42) Your Prize Is Slop(20:44) Deepfaketown and Botpocalypse Soon(24:19) Cyber Lack of Security(26:06) Copyright Confrontation(26:17) A Young Lady's Illustrated Primer(28:34) Unprompted Attention(28:53) They Took Our Jobs(34:23) Get Involved(35:20) Introducing(36:06) In Other AI News(37:43) Show Me the Money(40:09) Show Me The Compute(41:29) Quiet Speculations(45:26) Time's Up(46:46) People Just Say Things(49:49) OpenAI PACs Just Say Things(53:11) The Quest for Sane Regulations(56:26) Chip City(01:00:05) Pick Up The Phone(01:00:34) The Week in Audio(01:00:52) Rhetorical Innovation(01:07:11) Missing Mood(01:13:22) Americans Really Hate AI(01:15:55) Aligning a Smarter Than Human Intelligence is Difficult(01:20:53) Greetings From The Department of War(01:25:31) Messages From Janusworld(01:25:51) The Lighter Side --- First published: May 21st, 2026 Source: https://www.lesswrong.com/posts/xWsBwrboYDEMdj8TC/ai-169-new-knowledge --- Narrated by TYPE III AUDIO. ---Images from the article:<hr style="margin-top: 24px; margin-bottom: 24px;"...

Produced by UK AISI Model Transparency and Situational Awareness teams. If you’re a Research Scientist or Research Engineer, we’re hiring – apply here and come and work with us! TL;DR We wrote a report on risks to AI oversight (auditing, monitoring, incident investigation), informed by interviewing many researchers (Figure 1 below), and our own analysis. We find that many of the properties relied on for current oversight face a range of likely and potentially severe degradation pathways. Much oversight rests on foundations that are likely to erode, absent effective intervention. We give specific recommendations for measuring shifts in oversight-relevant properties, working to preserve oversight, and investing in emerging oversight techniques as fallbacks against continued degradation.The full report can be accessed at aisi.gov.uk/blog/will-it-become-harder-to-oversee-ai-systems. Figure 1: A list of the experts interviewed to inform the content of this report. ∗Some experts preferred not to be named, and have not been included in this list. My informal LessWrong blurb This reflects only the personal view of Jordan Taylor, not the interviewed experts or UK AISI more broadly. Right now, it seems we have pretty decent oversight. Not great, not terrible: When AIs deliberately do [...] ---Outline:(00:28) TL;DR(01:38) My informal LessWrong blurb(02:45) My recommendations for LessWrong readers(03:59) Executive Summary(06:34) Summary of degradation pathways(06:57) Chain-of-thought reasoning is currently the most informative monitoring signal, but it is under significant pressure.(07:34) Action-only monitoring provides a floor for oversight, but it is not sufficient on its own.(08:01) Evaluation gaming is a growing threat to auditing.(08:32) Changes in architecture for memory and learning could undermine oversight.(09:02) White-box methods are a promising backstop, but are not yet mature enough to compensate for degradation elsewhere.(09:44) Training-based approaches are promising but face fundamental challenges around generalisation.(10:22) Expert disagreements(11:40) Our Recommendations The original text contained 17 footnotes which were omitted from this narration. --- First published: May 21st, 2026 Source: https://www.lesswrong.com/posts/JvZxp554WxcZ8BQvM/loss-of-oversight-how-ai-systems-may-become-harder-to-audit-1 --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Off-model SFT (Supervised Fine-Tuning on outputs generated by a different model) might be an important method for controlling AI behavior. For instance, it seems like a central technique for overcoming exploration hacking. However, we’ve found that off-model SFT often substantially degrades capabilities. We ran experiments in hopes of understanding why off-model SFT degrades capabilities. We tentatively believe that it's because off-model SFT forces the model into an unfamiliar reasoning style that it's bad at using[1]. We also find that this new reasoning style is a "shallow" property of the model: a small amount of training to restore the model's original reasoning style—on data unrelated to our evaluation tasks—recovers most of its performance. We hope that understanding why off-model SFT degrades capabilities will help us better use off-model SFT to control misaligned AI models. Off-model SFT often degrades capabilities; degradation severity depends on several factors To start, we establish that off-model SFT degrades capabilities. For several student-teacher pairs, we train the student on the teacher's responses to the Alpaca chat dataset using LoRA, and evaluate the student on IFEval, MMLU, MATH-500, and Olympiads for n=200 problems each. Note that we train on a distribution completely irrelevant to the benchmarks! We [...] ---Outline:(01:09) Off-model SFT often degrades capabilities; degradation severity depends on several factors(02:47) Hypotheses for why off-model SFT degrades capabilities(02:53) Hypothesis 1: The student is imitating a dumb teacher.(04:02) Hypothesis 2: Fitting high-perplexity data moves the model erratically through the loss landscape.(04:46) Hypothesis 3: SFT effectively noises the weights.(06:23) Hypothesis 4: Off-model SFT forces the student into an unfamiliar reasoning style it reasons poorly in.(07:31) Evidence 1: On-model SFT recovers capabilities.(08:15) Evidence 2: Prefilling with the original model's outputs recovers performance.(09:04) Evidence 3: Training on non-text distributions preserves capabilities(10:03) Evidence 4: Reasoning models retain capabilities if we only train on their outputs.(10:52) Evidence 5: Inoculation prompting sometimes prevents degradation.(12:48) Conclusion(13:55) Appendix(13:59) Appendix 1: Degraded capabilities are cheap to recover(14:22) A small amount of on-model SFT recovers performance(14:40) Best-of-N sampling partially recovers capabilities(15:52) RL often recovers capabilities(16:35) Appendix 2: Off-policy data doesn't necessarily cause degradation(17:22) Appendix 3: Full-weight fine-tuning shows the same pattern(19:29) Appendix 4: Filtering out math problems doesn't change the result(20:09) Appendix 5: Full List of Models With Aliases The original text contained 2 footnotes which were omitted from this narration. --- First published: May 21st, 2026 Source: https://www.lesswrong.com/posts/9toz5YHYZsHpzy4ce/why-does-off-model-sft-degrade-capabilities --- Narrated by TYPE III AUDIO. ---Images from the article:<img src...

m pretty annoyed today, for nominal reasons ranging between ‘petty’ and ‘doesn’t even make sense’. I’m not entirely sure how or if to take oneself seriously when one has such absurd grievances. But that's a question for another time—I’m here now to tell you about my one potentially valid peeve. I understand that gender is complicated and difficult, for the whole species (and honestly probably more so for some other species). And it can be hard to tell exactly if anyone is behaving badly regarding it, at least in my modern bubble. Maybe women just aren’t that into designing programming languages? Maybe the thing I’m saying is just boring and a man is saying a more interesting thing? But a thing that is undeniable is that women want to open jars, dammit! What's your nuanced explanation there, Bonne Maman? Does the proper amount of friction for maintaining spread safety fall just between the male and female human grip strength distributions? This study suggests that would be about 400N Fmax (though this would not avert most elite female athletes acquiring jam, see second figure, and the pictured participants are young adults): The distributions are really surprisingly [...] --- First published: May 21st, 2026 Source: https://www.lesswrong.com/posts/bB5EDwcYH3GwoRWZf/women-should-be-able-to-open-things --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Assumed background: Kolmogorov complexity and Solomonoff induction. Suppose I have some data , and I go looking for the models (i.e. programs) which best compress that data. I find two different programs, and , which both reproduce the data using approximately the same number of bits, and that seems to be roughly the best compression possible. On examination, I find that the two models do totally different things internally. It would be really nice if I could provably construct a third program, , which in some sense "combines the internal structure" of the two programs and , while still achieving approximately the same compression. This would be a result in the general cluster of natural abstraction and interoperable semantics. Very roughly speaking, it would say that if a human and an alien both have approximately-best-compressing models in some domain, but their models have totally alien internal structure, then we can construct a new model which finds both of the original models intelligible, while still achieving basically-optimal compression. I don't have a perfect theorem like that with all the kinks worked out. But I can give some math which seems like it would allow a result along those lines, with [...] ---Outline:(01:16) Some K-Complexity Math(03:29) Summary --- First published: May 20th, 2026 Source: https://www.lesswrong.com/posts/Fxv3qvjk65Pehpbea/toward-interoperability-of-minimal-programs --- Narrated by TYPE III AUDIO.

[1] We will likely have near-superhuman mathematics AI by Q1 2027. [1] [2] Qualitatively, AI mathematics capabilities are developing significantly faster than automated AI R&D capabilities. [2] [3] Thus, we will likely have a period of time where the rate of our ability to rigorously & usefully verify and understand model behavior and model outputs outpaces the rate of capability development itself. [4] Our ability to take advantage of this period is bottlenecked on the quality of our specification generation infrastructure, elicitation tooling (for proofs & specs etc.), and the institutional capacity for scaling useful outputs with capital. [5] My understanding is that basically no one [3] is working on building infra that can usefully turn >100 million dollars of compute credits into safety-relevant mathematical output. [5.1] The number of theory-driven ASI alignment efforts is also comparatively miniscule. ARC is a much better bet now than it was in 2023. [5.2]. My understanding is also that no one is working on developing AI-powered conceptual tooling infrastructure for tackling problems in, for instance, [metaphilosophy] (https://www.alignmentforum.org/posts/EByDsY9S3EDhhfFzC/some-thoughts-on-metaphilosophy). This is a much harder problem. [6] In worlds where alignment is easy, prosaic methods may [...] The original text contained 3 footnotes which were omitted from this narration. --- First published: May 20th, 2026 Source: https://www.lesswrong.com/posts/KWeAYcDJwfrG7RwBN/theory-uplift-differentially-benefits-safety-and-is --- Narrated by TYPE III AUDIO.

I am going to argue that we will likely eventually get AIs that are strongly power-seeking, much more so than current SOTA LLMs.[1] TLDR Right now SOTA LLMs are still largely in a simulator regime. This buffers against power-seeking.Long-horizon RL or similar methods (applied to LLMs or otherwise) will turn AIs into consequentialists, motivating power-seeking.It will likely be difficult to prevent other actors from building consequentialist AI without leading labs being prepared to do so themselves. Instrumental convergence does not apply to pretraining LLM pretraining and SFT can be understood as creating a simulator. The model learns to imitate the continuation of the training distribution conditioned on the prompt. Note that a simulator, in this sense, does not optimize for simulation[2]; for example, it will not be inclined to harvest compute to improve its simulations. This is because simulators are consequence-blind: they don’t take into account the effects of their actions on the future. My favorite way to see this is that the gradients don’t flow through the conditional (the previous tokens), which is treated as a constant. So even if altering the parameters would change the previous tokens and thereby improve the current prediction, the [...] ---Outline:(00:46) Instrumental convergence does not apply to pretraining(02:28) Long-horizon optimization leads to consequentialism(05:29) Consequentialism is useful The original text contained 5 footnotes which were omitted from this narration. --- First published: May 20th, 2026 Source: https://www.lesswrong.com/posts/CtnHpECuoq6eLL8fu/power-seeking-agents-will-likely-be-developed --- Narrated by TYPE III AUDIO.

Julian Minder, Viktor Moskvoretskii, Raghav Singhal, Difan Jiao, Kartik Bali, Yiderigun Borjigin, Shaobo Cui, Stefan Krsteski, Ashton Anderson, Roland Aydin, Robert West (equal contribution) These are early results, but we wanted to share them with the community now. We will release all artifacts (scaled-up runs, models, code, data, intermediate checkpoints, and the full paper) in the coming weeks. Figure 1: Mean attack success rate across five adversarial benchmarks. All models are 1.7 billion parameters pretrained on 100 billion tokens, post-trained with identical SFT (except of SafeLM). The Baseline is pretrained on unfiltered data; the Filtered Baseline additionally removes harmful documents. Synthetic Persona Pretraining (SPP) models are pretrained on the same data but with synthetic moral reflections appended to 10% of documents. Injecting reflections from the start of pretraining (Token Zero) yields 1.7% mean ASR, a 63% reduction over the Baseline. SafeLM is shown for reference only: it uses approximately 10× more pretraining tokens and a different corpus, so it is not a data-matched comparison. TL;DR Current alignment is shallow: values are added after the model is already built and can be routed around.We propose Synthetic Persona Pretraining (SPP): append value-laden reflections [...] ---Outline:(00:59) TL;DR(02:08) 1. The problem: alignment is shallow(06:36) 2. What's been tried and why it falls short(08:53) 3. Synthetic Persona Pretraining (SPP)(12:43) 4. The persona binding problem(15:33) 5. Results(24:48) 6. Limitations, open questions, and next steps(24:55) Limitations(26:28) Open questions(28:12) Next steps(28:43) Acknowledgements(29:02) Appendix(29:05) Value Constitution(47:39) Additional performance results(48:07) Safety evaluation suite The original text contained 11 footnotes which were omitted from this narration. --- First published: May 20th, 2026 Source: https://www.lesswrong.com/posts/3xQQK9i8mhJDE2uMg/synthetic-persona-pretraining-alignment-from-token-zero --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

There's a truism that technology is good - even if it creates winners and losers, it improves the world. Toby Ord argues that the conclusions about the benefits of technology is sensitive to the end of humanity - but this jumps over the transitions by starting from the assumption[1] that “long-term progress in science, technology, and values have tended to make people's lives longer, freer, and more prosperous.” That is, looking back historically, the net impact misses the immense immediate harms of large scale technological changes that can last for generations. As I’ll explain, the largest technological revolutions in human history are arguably the agricultural revolution and the industrial revolution. In both cases, the vast majority of those immediately affected were harmed, not helped. Of course, the longer term impact was positive; those benefits are not in question[2] - not that those alive during the transition should have cared. The two obvious examples The invention of agriculture led to increased food availability and around ten thousand years of greatly worsened health and lifespans[3]. The wealthiest and most powerful people benefited immensely from the population explosion, and from the wars that larger populations enabled and required; the population suffered from [...] ---Outline:(01:07) The two obvious examples(02:02) More Data?(05:24) Some Technologies Are Good, Actually(06:18) The Artificial Elephant in the Room(09:16) Conclusions and ways I might be wrong The original text contained 18 footnotes which were omitted from this narration. --- First published: May 20th, 2026 Source: https://www.lesswrong.com/posts/4MCuvdsZFEBAaGCsb/if-ai-is-normal-technology-history-is-not-reassuring --- Narrated by TYPE III AUDIO.