Models that appear aligned under black-box evaluation may conceal substantial latent misalignment beneath their observable behavior.

Let's say you downloaded a language model from Huggingface. You do all the blackbox evaluation for the safety/alignment, and you are convinced that the model is safe/aligned. But how badly can things go after you update the model? Our recent work shows, both theoretically and empirically, that a language model (or more generally, a neural network) can appear perfectly aligned under black-box evaluation but become arbitrarily misaligned after just a single gradient step on an update set. Strikingly, this observation can happen under any definition of blackbox alignment and for any update set (benign or adversarial). In this post, I will deep dive into this observation and talk about its implications.

Theory: Same Forward Computation, Different Backward Computation

LLMs or NNs in general are overparameterized. This overparameterization can lead to an interesting case: 2 differently parameterized models can have the exact forward pass. Think about a simple example: the two-layer linear model and the model . Both models output the input x directly, but backward computations are totally different.

Now consider a model that is perfectly aligned under blackbox evaluation, i.e. [...]

---

Outline:

(01:07) Theory: Same Forward Computation, Different Backward Computation

(03:18) Hair-Trigger Aligned LLMs

(05:44) Whats Next?

---

First published:
March 14th, 2026

Source:
https://www.lesswrong.com/posts/uSgw9muqRZpjpxKDA/llm-misalignment-can-be-one-gradient-step-away-and-blackbox-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Diagram showing three panels illustrating machine learning model alignment states with icebergs.

Diagram comparing static black-box evaluation with post-update behavior of AI models.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.