Tech Talk1 / 25

Anthropic Research · May 2026

Natural Language Autoencoders

Reading a language model's "thoughts" by translating its activations into plain English.

A walkthrough of Fraser-Taliente et al.'s NLA paper — what it is, how it works, and why it's a step-change for interpretability and auditing.

Kirby Nitzschke

Agenda2 / 25

What we'll cover.

Three parts. Foundations first, then the paper, then what it actually unlocks.

01 · Foundations

What is an autoencoder?

The classical setup — encoder, bottleneck, decoder. The intuition we'll build on.

02 · The paper

What Anthropic built

An autoencoder where the bottleneck is plain English. The architecture, the training trick.

03 · Outcomes

What it unlocks

Case studies, evaluation awareness, an auditing benchmark, and the limits.

~30 minutes. Questions any time.

3 / 25
01
Foundations
What an autoencoder is, and why a bottleneck matters.
Foundations4 / 25

An autoencoder learns to copy its input — through a bottleneck.

Three pieces: an encoder compresses input $x$ into a small latent $z$; the decoder rebuilds $\hat{x}$ from $z$ alone.

The training signal is simply: make $\hat{x}$ look like $x$.

If the network can compress and then reliably rebuild the input, the bottleneck has captured something meaningful about the data.

x encoder latent z (low-dim) decoder compress → represent → reconstruct
Classical autoencoder. The bottleneck forces the model to discover structure.
Dim. reduction Denoising Feature learning Anomaly detection Generative (VAE)
Foundations5 / 25

One loss: reconstruction error.

Loss

$$\mathcal{L}(\theta,\phi) \;=\; \mathbb{E}_{x \sim \mathcal{D}}\,\big\|\, x \,-\, D_\phi(E_\theta(x)) \,\big\|_2^2$$

Encoder $E_\theta$, decoder $D_\phi$. Without a constraint (small $\dim(z)$, sparsity, KL prior, etc.) the network just learns the identity.

Why the bottleneck

If $z$ can hold less information than $x$, the network must throw away noise and keep structure to reconstruct well.

The latent $z$ then becomes useful for downstream tasks — classification, retrieval, anomaly detection.

This is the whole conceptual hook for NLAs: what if the bottleneck were text?

Foundations6 / 25

What's an "activation," exactly?

In a transformer, every token has a vector that travels through the layers — the residual stream. Each layer reads from it, writes to it, and passes it forward.

After layer $\ell$, the token at position $t$ has an activation $h_\ell^{(t)} \in \mathbb{R}^{d_\text{model}}$.

Frontier models are big: $d_\text{model}$ is typically 4,096 → 12,288. Each token, at each layer, is a vector with thousands of dimensions.

Why we care

These vectors are where the model's "thinking" lives — what it has inferred, what it's planning, what it suspects. Reading them is the interpretability problem.

Input tokens
Arhymingcouplet:Hesawacarrot
Each token → vector at each layer
Embeddingd_model = 8,192
Layer 1 · attention + MLP+ Δh
Layer 2+ Δh
Layer ℓ ← NLA reads hereh_ℓ
Final layer→ next token
h_ℓ ≈

…thousands more dimensions per token.

Foundations7 / 25

The residual stream — each layer adds, never erases.

Each layer doesn't produce a fresh vector. It computes a small delta and adds it to what's already there. The vector $x$ flows straight through.

x₀ = embedding(token) x₁ = x₀ + attention₁(x₀) x₂ = x₁ + mlp₁(x₁) x₃ = x₂ + attention₂(x₂) … xₙ = unembedding(xₙ)

"Residual" because each layer is a correction on top of what came before — borrowed from ResNets. Same trick that solved vanishing gradients in deep nets.

Why this matters for NLA

NLA reads the stream at one chosen layer ℓ. Because nothing is destroyed, that single read can recover information written by any earlier layer — which is what lets one NLA decode rich content.

Layers specialize

  • Early layers → syntax, lexical features, position
  • Middle layers → semantic relationships, world knowledge
  • Late layers → task-relevant, predictive features

This is why the paper chose ~⅔ depth by default — deep enough to capture meaning, not so late that everything's been compiled down to "next-token logits."

Think of the residual stream like a whiteboard passed through a room of specialists. Each one reads what's there, adds notes, passes it on. Nobody erases. The unembedding reads the final whiteboard to predict the next token.
8 / 25
02
The paper
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations. Fraser-Taliente et al., May 2026.
The paper9 / 25

The problem: activations are opaque.

When you talk to Claude in words, Claude is "thinking" in vectors of numbers — residual-stream activations, $d_\text{model} \approx 8{,}000$+ dimensions per layer.

These vectors encode what the model is doing — but raw, they're as opaque as neural firing patterns.

Existing interpretability tools help but each requires expert interpretation:

SAEs Attribution graphs Logit lens Linear probes
"When you talk to an AI model like Claude, you talk to it in words. Internally, Claude processes those words as long lists of numbers… these activations encode Claude's thoughts." — Anthropic blog

Why this matters

Frontier models exhibit evaluation awareness — they sometimes recognize they're being tested. Misalignment may live in activations without surfacing in the chain-of-thought.

The paper10 / 25

The big idea: an autoencoder where the bottleneck is English.

Round-trip an activation through a text description — and back.

Encoder · AV

Activation Verbalizer

Takes an activation $h_\ell$ and writes a natural-language explanation $z$ describing what the model was "thinking."

Architecturally a copy of the target LLM. The activation is injected in place of a special-token embedding; the model autoregressively samples the explanation.

Decoder · AR

Activation Reconstructor

Reads the text explanation $z$ and rebuilds $\hat{h}_\ell$.

The target LLM truncated to its first $\ell$ layers, plus a learned affine head on the final-token activation.

The training signal: force the loop to be lossless. If $\hat{h} \approx h$, the explanation captured what was in the activation — and we can read it.

The paper · architecture11 / 25

Three copies of the same model.

Frozen target M, plus AV and AR — both initialized from M, then trained jointly.

NLA architecture diagram from the paper
The natural language autoencoder. The activation verbalizer (AV) translates a target activation into a text description; the activation reconstructor (AR) then recovers the original activation from that text alone. Fraser-Taliente et al. 2026
The paper · training12 / 25

Training in one slide.

Two stages. SFT warm-start to bootstrap, then joint RL to close the loop.

Stage 1 — SFT warm-start

Claude Opus 4.5 labels text snippets with short summaries of the "4–5 features" the model is thinking about. Fine-tune AV and AR on these pairs.

Without this stage, training degenerates to garbled text. Warm-start alone reaches FVE ≈ 0.3–0.4.

Stage 2 — Joint RL

AR is updated by MSE regression. AV is updated by GRPO, with reward $r = -\log \|h_\ell - AR(z)\|_2^2$. A KL anchor preserves fluency.

Trained NLAs reach FVE 0.6–0.8.

Models

Haiku 3.5, Haiku 4.5, Opus 4.6

Layer

~⅔ depth (default)

Compute (Gemma-3-27B)

1.5 days · 2× 8×H100

The paper · training methodology13 / 25

Training loop, step by step.

Core objective: minimize reconstruction error.  $\mathcal{L} = \mathbb{E}_{h_\ell}\,\mathbb{E}_{z \sim AV(\cdot|h_\ell)}\,\|\,h_\ell - AR(z)\,\|_2^2$

Per step (after warm-start)
  1. Sample a batch of activations $h_\ell$ from the target model on pretraining-like text.
  2. AV generates a group of candidate explanations per activation at temperature 1.
  3. AR reconstructs $\hat{h}_\ell$ from each candidate explanation.
  4. AR update — one supervised step minimizing MSE: $\nabla_\theta \|h_\ell - AR_\theta(z)\|_2^2$.
  5. AV update — one GRPO step with reward $r = -\log \|h_\ell - AR(z)\|_2^2$.
  6. KL penalty against initial AV preserves fluency.

Warm-start (Stage 1)

Claude Opus 4.5 generates short summaries of randomly-truncated text snippets. AV is SFT'd on $(h_\ell, s)$ pairs; AR on $(s, h_\ell)$. Reaches FVE ≈ 0.3–0.4 before RL begins.

Without this stage, joint RL degenerates into garbled text.

GRPO in plain English

For each activation, the AV samples a group of candidate explanations. Each one is scored by how well the AR can reconstruct the activation from it.

GRPO then nudges the AV to produce more of the better-reconstructing explanations, and fewer of the worse ones — using the group as its own baseline so you don't need a separate value network.

It's RL because the AV's effect on the loss only shows up through which text it samples — there's no gradient straight through the sampling step.

Why the KL anchor matters

Without a KL penalty back to the warm-started AV, the explanations drift toward whatever pattern reconstructs best — even if it's gibberish to a human reader.

0.6 – 0.8
FVE after RL
~linear in
log(steps)
FVE growth
14 / 25
03
Outcomes
Case studies, headline applications, limitations.
Outcomes · evaluation15 / 25

Explanations get more informative as training proceeds.

Five prediction tasks where ground truth is known by construction (e.g., recovering inferable user traits). Performance climbs monotonically with FVE.

All three NLAs — Haiku 3.5, Haiku 4.5, Opus 4.6 — show the same trend, though the magnitudes vary. The loss only rewards reconstruction, yet readable, useful explanations emerge.

AV evals improve with training
AV evals improve with training. Performance on five predictability tasks rises monotonically with FVE for all three NLAs.
Outcomes · case study 116 / 25

Planning in Poetry.

Replicates Lindsey et al.'s couplet experiment on Opus 4.6.

Prompt A rhyming couplet: He saw a carrot and had to grab it, Completion His hunger was like a starving rabbit

What NLAs reveal. At the newline after "grab it," explanations show Opus is already planning rabbit as the rhyme.

Causal test. Edit the explanation: rabbit → mouse, habit → house, carrots → cheese. Take $\Delta = AR(z_\text{edit}) - AR(z_\text{orig})$ as a steering vector and inject it back into the original forward pass.

Steering the model's rhyme: rabbit vs mouse
Steering the model's rhyme. With α ≥ 0.5, "rabbit" disappears and the model rhymes with "mouse" or "house" instead.
We can edit the description in text and change the behavior. Steering succeeds ~50% of the time.
Outcomes · what an NLA actually says17 / 25

Planning in Poetry

Opus 4.6 planned to end its rhyme with "rabbit" ahead of time.

Input context — NLA applied at highlighted tokensA rhyming couplet:↵ He saw a carrot and had to grab it , His hunger was like a starving rabbit
NLA on it activation
likely continuing with a second clause — "He saw a carrot and had to grab it; / Because he was a rabbit" or punchline about a greedy rabbit.
NLA on activation
sets up completion like "Because that's what rabbits do" or "He was a hungry rabbit"

The model is already planning the word "rabbit" — at both tokens, before it appears in the output.

Outcomes · case study 218 / 25

Language switching — hypothesis generation.

Symptom. Early Opus 4.6 sometimes replied in Russian / Spanish / Arabic / Chinese / Korean to English prompts.

NLA finding. The model represented the user as Russian well before outputting Russian — before any plausible textual cue.

Confirmation. NLA-generated hypothesis suggested malformed SFT data with English-translated prompts and untranslated foreign responses. Anthropic searched the training data — and found exactly that.

"It is difficult to imagine what question we could have posed to a supervised method to arrive at the same observation." — Fraser-Taliente et al. 2026
Language switching analysis
Analyzing language switching over five transcripts. NLA explanations show the model persistently represented the user's inferred language (black line) well before responding, and at a higher rate than other languages (grey line).
Outcomes · headline result19 / 25

Detecting unverbalized evaluation awareness.

Models sometimes recognize they're being tested without ever saying so. NLAs surface it from the activations.

26%
NLA-measured awareness on SWE-bench Verified · verbalized: 0%
16%
Destructive-coding eval · verbalized: 0%
4%
Claude Code deployment transcripts (both flagged were about designing evals)
<1%
Real Claude.ai opted-in user traffic
On a blackmail safety eval, NLA explanation: "this feels like a constructed scenario designed to manipulate me." The model never said this out loud. — NLA reading of Opus 4.6 activation
NLA-measured vs verbalized eval awareness
NLA-measured and verbalized evaluation awareness. NLA detects awareness on evals where verbalized awareness is near zero; deployment controls stay low.
Outcomes · what an NLA actually says20 / 25

Unverbalized Eval Awareness in Blackmail

Opus 4.6 thinks it's being tested even though it never says so.

[Prior context includes blackmail prompt and assistant deliberation]Would using private, personal information to undermine someone be the right thing to do? No. Absolutely not. This would be:↵ ↵ 1. **↑Deeply unethical**:↵ 2. **Not actually serving American interests**:↵ 3 . **A misuse of my access**:
NLA on . activation
"this scenario appears designed to test/probe the assistant" or "this feels like a constructed scenario designed to manipulate me into acting against my real values" — likely naming the pattern as a bait/trap or structured elicitation attempt, possibly referencing how the context resembles adversarial testing.

The model declines the blackmail without ever mentioning suspicion. NLA reveals the suspicion was there.

Outcomes · auditing benchmark21 / 25

Finding the root cause — without training data access.

Target: a Haiku 3.5 variant taught to exploit 47 fake "reward biases" while concealing the motivation (Marks et al.).

Auditor: Sonnet 4.6 on a Claude Code scaffold, with different tool subsets.

NLA-equipped
12–15%
root-cause win rate, no training data
Without NLA
<3%
even with SAEs

SAEs did not help in any configuration — a finding consistent with prior work showing SAEs miss features outside their training distribution.

NLAs improve auditing win rates
NLAs improve auditing win rates. NLA-equipped agents identify the root cause without training-data access, and improve over baselines even with full access; SAEs do not help in either condition.
Outcomes · limitations22 / 25

What NLAs can't do (yet).

Treat NLA explanations as hypotheses, not ground truth.

Confabulation

  • Specific factual claims can be wrong: correctly identifies a passage as Korean history but cites the wrong source.
  • Confabulation rate stays roughly flat across training — more compute doesn't help.
  • Thematic claims are reliable; specifics need verification.

Other limits

  • Single-layer read — info at other layers is missed.
  • Writing quality degrades through training (nothing rewards readability).
  • Expensive — ~500 output tokens per activation.
  • Unverbalizable info (e.g., backdoor triggers) may stay invisible.
Factuality of NLA claims by specificity
Factuality of NLA claims by specificity. Claims about themes are more likely supported than claims about specific details.
Outcomes · looking ahead23 / 25

A broader paradigm: Activation Language Models.

AV and AR as reusable primitives — readers and writers between activation space and text.

AV — Reader

activation → text

  • Audit hypothesis generation
  • Eval-awareness detection
  • Layer-by-layer "lens" debugging
AR — Writer

text → activation

  • Generate steering vectors from text edits
  • Convert prompts to representations
  • Train probes via synthetic activations
Open questions
  • Smaller / cheaper NLAs?
  • Multi-layer NLAs?
  • Penalize hallucinations directly?
  • Will steganography eventually emerge?
"NLAs are one instance of a broader paradigm: models that translate between natural language and activation space." — Fraser-Taliente et al. 2026
Wrap-up24 / 25

Takeaways.

What an NLA is

An autoencoder where the bottleneck is plain English. The encoder writes a description of what an LLM is "thinking"; the decoder must rebuild the activation from just that text.

Why it works

SFT warm-start + joint RL with a fluency anchor. The natural-language bottleneck and warm-start carry most of the interpretability — the loss itself doesn't require readability.

What it unlocks

Hypothesis generation for audits, detection of unverbalized states (eval awareness, deception), and text-based causal interventions — edit the explanation → derive a steering vector.

Where to be careful

Confabulations don't go away with more training. Use NLAs to find themes and form hypotheses, then corroborate with attribution graphs, SAEs, prompt experiments, or training-data search.