Anthropic Research · May 2026
Reading a language model's "thoughts" by translating its activations into plain English.
A walkthrough of Fraser-Taliente et al.'s NLA paper — what it is, how it works, and why it's a step-change for interpretability and auditing.
Kirby Nitzschke
Three parts. Foundations first, then the paper, then what it actually unlocks.
The classical setup — encoder, bottleneck, decoder. The intuition we'll build on.
An autoencoder where the bottleneck is plain English. The architecture, the training trick.
Case studies, evaluation awareness, an auditing benchmark, and the limits.
~30 minutes. Questions any time.
Three pieces: an encoder compresses input $x$ into a small latent $z$; the decoder rebuilds $\hat{x}$ from $z$ alone.
The training signal is simply: make $\hat{x}$ look like $x$.
If the network can compress and then reliably rebuild the input, the bottleneck has captured something meaningful about the data.
$$\mathcal{L}(\theta,\phi) \;=\; \mathbb{E}_{x \sim \mathcal{D}}\,\big\|\, x \,-\, D_\phi(E_\theta(x)) \,\big\|_2^2$$
Encoder $E_\theta$, decoder $D_\phi$. Without a constraint (small $\dim(z)$, sparsity, KL prior, etc.) the network just learns the identity.
If $z$ can hold less information than $x$, the network must throw away noise and keep structure to reconstruct well.
The latent $z$ then becomes useful for downstream tasks — classification, retrieval, anomaly detection.
This is the whole conceptual hook for NLAs: what if the bottleneck were text?
In a transformer, every token has a vector that travels through the layers — the residual stream. Each layer reads from it, writes to it, and passes it forward.
After layer $\ell$, the token at position $t$ has an activation $h_\ell^{(t)} \in \mathbb{R}^{d_\text{model}}$.
Frontier models are big: $d_\text{model}$ is typically 4,096 → 12,288. Each token, at each layer, is a vector with thousands of dimensions.
These vectors are where the model's "thinking" lives — what it has inferred, what it's planning, what it suspects. Reading them is the interpretability problem.
…thousands more dimensions per token.
Each layer doesn't produce a fresh vector. It computes a small delta and adds it to what's already there. The vector $x$ flows straight through.
"Residual" because each layer is a correction on top of what came before — borrowed from ResNets. Same trick that solved vanishing gradients in deep nets.
NLA reads the stream at one chosen layer ℓ. Because nothing is destroyed, that single read can recover information written by any earlier layer — which is what lets one NLA decode rich content.
This is why the paper chose ~⅔ depth by default — deep enough to capture meaning, not so late that everything's been compiled down to "next-token logits."
When you talk to Claude in words, Claude is "thinking" in vectors of numbers — residual-stream activations, $d_\text{model} \approx 8{,}000$+ dimensions per layer.
These vectors encode what the model is doing — but raw, they're as opaque as neural firing patterns.
Existing interpretability tools help but each requires expert interpretation:
Frontier models exhibit evaluation awareness — they sometimes recognize they're being tested. Misalignment may live in activations without surfacing in the chain-of-thought.
Round-trip an activation through a text description — and back.
Takes an activation $h_\ell$ and writes a natural-language explanation $z$ describing what the model was "thinking."
Architecturally a copy of the target LLM. The activation is injected in place of a special-token embedding; the model autoregressively samples the explanation.
Reads the text explanation $z$ and rebuilds $\hat{h}_\ell$.
The target LLM truncated to its first $\ell$ layers, plus a learned affine head on the final-token activation.
The training signal: force the loop to be lossless. If $\hat{h} \approx h$, the explanation captured what was in the activation — and we can read it.
Frozen target M, plus AV and AR — both initialized from M, then trained jointly.
Two stages. SFT warm-start to bootstrap, then joint RL to close the loop.
Claude Opus 4.5 labels text snippets with short summaries of the "4–5 features" the model is thinking about. Fine-tune AV and AR on these pairs.
Without this stage, training degenerates to garbled text. Warm-start alone reaches FVE ≈ 0.3–0.4.
AR is updated by MSE regression. AV is updated by GRPO, with reward $r = -\log \|h_\ell - AR(z)\|_2^2$. A KL anchor preserves fluency.
Trained NLAs reach FVE 0.6–0.8.
Haiku 3.5, Haiku 4.5, Opus 4.6
~⅔ depth (default)
1.5 days · 2× 8×H100
Core objective: minimize reconstruction error. $\mathcal{L} = \mathbb{E}_{h_\ell}\,\mathbb{E}_{z \sim AV(\cdot|h_\ell)}\,\|\,h_\ell - AR(z)\,\|_2^2$
Claude Opus 4.5 generates short summaries of randomly-truncated text snippets. AV is SFT'd on $(h_\ell, s)$ pairs; AR on $(s, h_\ell)$. Reaches FVE ≈ 0.3–0.4 before RL begins.
Without this stage, joint RL degenerates into garbled text.
For each activation, the AV samples a group of candidate explanations. Each one is scored by how well the AR can reconstruct the activation from it.
GRPO then nudges the AV to produce more of the better-reconstructing explanations, and fewer of the worse ones — using the group as its own baseline so you don't need a separate value network.
It's RL because the AV's effect on the loss only shows up through which text it samples — there's no gradient straight through the sampling step.
Without a KL penalty back to the warm-started AV, the explanations drift toward whatever pattern reconstructs best — even if it's gibberish to a human reader.
Five prediction tasks where ground truth is known by construction (e.g., recovering inferable user traits). Performance climbs monotonically with FVE.
All three NLAs — Haiku 3.5, Haiku 4.5, Opus 4.6 — show the same trend, though the magnitudes vary. The loss only rewards reconstruction, yet readable, useful explanations emerge.
Replicates Lindsey et al.'s couplet experiment on Opus 4.6.
What NLAs reveal. At the newline after "grab it," explanations show Opus is already planning rabbit as the rhyme.
Causal test. Edit the explanation: rabbit → mouse, habit → house, carrots → cheese. Take $\Delta = AR(z_\text{edit}) - AR(z_\text{orig})$ as a steering vector and inject it back into the original forward pass.
Opus 4.6 planned to end its rhyme with "rabbit" ahead of time.
The model is already planning the word "rabbit" — at both tokens, before it appears in the output.
Symptom. Early Opus 4.6 sometimes replied in Russian / Spanish / Arabic / Chinese / Korean to English prompts.
NLA finding. The model represented the user as Russian well before outputting Russian — before any plausible textual cue.
Confirmation. NLA-generated hypothesis suggested malformed SFT data with English-translated prompts and untranslated foreign responses. Anthropic searched the training data — and found exactly that.
Models sometimes recognize they're being tested without ever saying so. NLAs surface it from the activations.
Opus 4.6 thinks it's being tested even though it never says so.
The model declines the blackmail without ever mentioning suspicion. NLA reveals the suspicion was there.
Target: a Haiku 3.5 variant taught to exploit 47 fake "reward biases" while concealing the motivation (Marks et al.).
Auditor: Sonnet 4.6 on a Claude Code scaffold, with different tool subsets.
SAEs did not help in any configuration — a finding consistent with prior work showing SAEs miss features outside their training distribution.
Treat NLA explanations as hypotheses, not ground truth.
AV and AR as reusable primitives — readers and writers between activation space and text.
activation → text
text → activation
An autoencoder where the bottleneck is plain English. The encoder writes a description of what an LLM is "thinking"; the decoder must rebuild the activation from just that text.
SFT warm-start + joint RL with a fluency anchor. The natural-language bottleneck and warm-start carry most of the interpretability — the loss itself doesn't require readability.
Hypothesis generation for audits, detection of unverbalized states (eval awareness, deception), and text-based causal interventions — edit the explanation → derive a steering vector.
Confabulations don't go away with more training. Use NLAs to find themes and form hypotheses, then corroborate with attribution graphs, SAEs, prompt experiments, or training-data search.
Everything used in this deck, plus the code and a video walkthrough.