🧠 Four Axes of LLM Mechanistic Interpretability

23 Apr, 2026 · Updated 24 Apr, 2026 · Junjie LIU

TL;DR

Mechanistic interpretability of transformers has matured along three axes: states (Logit / Tuned Lens, probing), causality (activation patching, ACDC, feature circuits), and contrastive states (RepE, CAA, steering vectors). Most interpretability blogs cover the first three.
A fourth lineage exists in the dynamical-systems community, e.g., block-Jacobian spectra, mean-field / Vlasov limits, phase portraits, Langevin / Fokker–Planck fits, Koopman / DMD, and Liang–Kleeman information flow. This lineage studies the residual stream unconditionally, i.e., as background structure, not under task contrasts.
The missing fourth axis is differential transition operators, i.e., how the layer-$\ell$ update map $f_\ell$ changes between two task conditions. Object from Axis 4. Framing from Axes 2 and 3. Nobody has bridged them.
There is no unified pathway-share metric, i.e., one number per layer comparable across tasks and scales that says how much work goes through attention vs MLP. Such a metric lives on the transition $f_\ell$, not on the state $h_\ell$.
Audience: someone who has read the IOI paper and Logit Lens, but not the dynamics literature.

Before

Most interpretability blog posts walk through methods in chronological order, e.g., Logit Lens first, then activation patching, then ACDC, then sparse autoencoders, then steering vectors. After reading them you know what each method is called, but you don’t really know what question each method is trying to answer. The literature is small enough that the methods recombine, and the chronological reading flattens the actual structure.

A more useful organization is by target, i.e., which aspect of the model does each generation of methods try to explain. Once organized this way, the existing eras line up onto three axes (states, causality, dynamics), and a fourth axis (differential transition operators) is sitting in plain sight. The reason it is sitting in plain sight is that the interpretability crowd and the dynamical-systems crowd do not really read each other’s papers.

This post is for someone who has read the IOI paper [1] and skimmed Logit Lens, but has never heard of Koopman operators or Fokker–Planck. The first three axes are a fast recap. The fourth one is the dynamics literature, plus the gap that I want to highlight.

States

The oldest interpretability tradition for transformers is state readout. Take the hidden activation at intermediate layer $\ell$, project it back into vocabulary space, and read off what the model “thinks” it is going to say at that depth. nostalgebraist’s Logit Lens post on LessWrong (2020) is the canonical version: apply the unembedding matrix $W_U$ directly to the residual stream $h_\ell$, and look at the top-k tokens of $\text{softmax}(W_U h_\ell)$. The result is informative: by mid-depth the residual stream’s top tokens already concentrate on plausible continuations.

The problem is that Logit Lens implicitly assumes the basis at every layer is the same basis the unembedding was trained on, which is only approximately true. Tuned Lens [2] fixes this with a small affine probe $A_\ell$ trained per layer, such that $\text{softmax}(W_U A_\ell h_\ell)$ matches the final-layer distribution. The lens becomes calibrated rather than naive.

Fig. 1 — Logit Lens vs Tuned Lens. Linear probing is the same family of idea pushed further: train a small classifier on top of $h_\ell$ to recover any property you care about, e.g., sentiment, syntax, factual recall, position, or the answer to a multiple-choice question, and ask at which depth the property becomes linearly decodable.

These methods answer one question well: what does the model represent, and where in the stack? They do not answer a second question: how is that representation built? A linear probe at layer 12 tells you that subject information is there. It does not tell you which earlier layers wrote it, or how stable the encoding is.

Causality

The intervention era is the response to the gap above. The unifying move is counterfactual ablation: run the model on a clean prompt $x$, run it again on a corrupted prompt $x’$ that differs in one feature, then patch some component (e.g., a head’s output, an MLP block, a single residual position) from one run into the other and see if the prediction follows the patch. If patching head $\ell.h$ from the clean run into the corrupted run recovers the clean prediction, that head is causally responsible for the behavior.

Activation patching gives per-component effect sizes. Attribution patching linearizes the patch with a first-order gradient approximation, such that you do not need to rerun the model once per intervention. ACDC (Automated Circuit DisCovery) [3] automates the next step: greedily prune the model’s computation graph until only the components whose removal hurts the metric remain, and report the resulting sparse subgraph as a “circuit”. Feature circuits push the same idea down to SAE features rather than whole heads or MLPs.

The intervention era is what lets us make claims like “this head copies the indirect object” or “this MLP stores the capital-of-country fact”. It is the most causally honest axis we have. The cost is twofold: (a) every claim is per-task, i.e., one circuit per behavior, and the search is expensive per hypothesis; and (b) the output is a graph of components, not a description of the transition connecting them. Patching tells you head 9.6 matters for IOI. It does not give you an operator that maps “subject is in residual stream” to “indirect object will be next token”.

Contrastive States

The third axis takes a more pragmatic stance. Instead of asking which components compute a behavior, it asks: does the activation difference between behaviors live in a low-dimensional subspace, and can we move along that subspace to control the behavior?

The recipe is the same across this whole literature. Collect activations on contrastive prompt pairs, e.g., “happy” vs “sad”, “truthful” vs “deceptive”, or “refuses” vs “complies”. Take the mean activation difference $\Delta h_\ell = \mathbb{E}[h_\ell \mid +] - \mathbb{E}[h_\ell \mid -]$. Use that vector as both an analysis object (a direction in activation space tagged with semantic meaning) and a manipulation handle (add or subtract a scaled copy at inference time, and watch the behavior change). Representation Engineering [4] A bit of a name collision with "representation learning". They are not the same thing, despite the vocabulary overlap. formalizes this for high-level concepts like honesty and emotion. Contrastive Activation Addition (CAA) is the same idea localized to a single layer. The broader steering-vector literature treats it as a general control surface for safety-relevant behaviors.

This axis is appealing because it links state geometry to behavior with one cheap forward pass. The limitation is built in: it operates at the state level, not at the transition level. A steering vector tells you which direction in $\mathbb{R}^{d}$ correlates with the contrast. It does not tell you how the update map differs between the two conditions. If two behaviors share the same mean shift but differ in how the residual stream evolves around that shift, e.g., different curvature, different mixing rates, or different attention routing, the steering vector cannot see it.

So Axes 1 and 3 both target state. Axis 1 asks what is decodable from the state. Axis 3 asks what direction in state-space shifts behavior. Axis 2 is the causality axis. None of the three speak to how the state moves.

Dynamics: An Unread Lineage

The residual stream is a trajectory:

$$h_{0} \to h_{1} \to \dots \to h_{L}, \quad h_{\ell+1} = h_{\ell} + f_{\ell}(h_{\ell}),$$

where $f_{\ell}$ is whatever the $\ell$-th block computes (attention + MLP, with a residual bypass). This is a discrete-time dynamical system, and dynamical-systems people have a lot to say about discrete-time dynamical systems.

Fig. 2 — Residual-stream trajectory and Fokker–Planck potential landscape. Five threads worth knowing about:

Block-Jacobian spectra. Li and Papyan [5] compute the Jacobian of one transformer block, $J_\ell = \partial h_{\ell+1}/\partial h_\ell$, and characterize its eigenvalue distribution across depth. They call the phenomenon residual alignment, i.e., the residual updates align with a shrinking set of singular directions as depth grows. Early layers behave like contraction, mid layers expand a few subspaces, late layers re-collapse onto a low-dimensional output manifold. This gives a depth-dependent operator picture rather than a state picture.
Mean-field / Vlasov limits. Castin et al. take the infinite-token limit of self-attention and derive a continuity equation for the empirical distribution of token representations, such that the residual stream becomes a flow on a measure space. The transformer becomes a discretization of a particular partial differential equation. This is the cleanest theoretical entry point for asking “what does attention do to a distribution of tokens?”
Phase portraits. Fernando and Guitchounts visualize forward dynamics as trajectories in low-dimensional projections of the residual stream, and look for fixed points, limit cycles, and saddle structure. The picture that emerges is depth as a slow flow toward attractor manifolds, which is qualitatively different from the “the model thinks at layer 12” mental image you get from probing.
Stochastic / Fokker–Planck fits. Sarfati et al. fit a Langevin-type model $$dh = -\nabla U(h),d\ell + \sigma,dW_\ell$$ to the residual-stream evolution, and recover an effective potential $U(h)$ whose minima correspond to high-confidence answers. The picture is: forward depth is integration time, i.e., each layer is one step of an SDE rather than an arbitrary computation; the residual stream flows downhill on a potential landscape $U(h)$; and the equilibrium distribution at any depth has the Boltzmann form $p(h) \propto e^{-U(h)/T}$, such that the model’s softmax confidence corresponds to a temperature on the basins of $U$. The dual view is the Fokker–Planck equation $$\partial_\ell,p(h, \ell) = \nabla \cdot \bigl(p,\nabla U\bigr) + \tfrac{\sigma^{2}}{2}\nabla^{2} p,$$ which is the distributional view that Vlasov / mean-field analyses operate in. This is the only thread in the lineage that explicitly treats depth as integration time.
Operator-theoretic methods (Koopman, DMD). Approaches in the spirit of ATO apply Dynamic Mode Decomposition to the layer-by-layer activations to extract a linear operator that best explains the (nonlinear) state evolution. The Koopman framing converts a nonlinear dynamical question into a linear spectral one, i.e., modes, frequencies, decay rates.
Information-theoretic flow (Liang–Kleeman). The Liang–Kleeman framework quantifies the causal influence of one variable on another in a dynamical system, with units of bits per time step, and importantly does not conflate correlation with causation. Applied to the residual stream, it would say not just “head 9.6 matters” but “head 9.6 transmits $X$ bits per layer to position $i$ about feature $Y$”. Almost nobody in the interpretability literature has done this at scale.

The important caveat is that this entire lineage studies unconditional dynamics. It characterizes the flow on average, in expectation over the data distribution, or as a property of the trained network independent of any task. It is background structure, not task-specific structure. The Jacobian spectrum of layer 9 is a property of the network. The IOI circuit is a property of the network under a particular task contrast. These are different objects, and so far they live in different papers, written by different communities.

The Gap

Fig. 3 — The four eras arranged by (object of study × task conditioning); the bottom-right cell is the gap.

Put the four eras side by side:

Axis	Object of study	Granularity	Conditional on task?
1. States (Logit / Tuned Lens, probes)	$h_\ell$	per layer	partly, probes are
2. Causality (patching, ACDC, feature circuits)	components	per head/MLP	yes, one circuit per task
3. Contrastive states (RepE, CAA, steering)	$\Delta h_\ell$	per layer/region	yes
4. Dynamics (Jacobians, Koopman, Vlasov, L-K)	transition $f_\ell$	per layer	no, unconditional

The diagonal entry is empty. There is no body of work that asks “how does the transition operator $f_\ell$ change between two task conditions?” Axis 4 has the right object (the transition). Axes 2 and 3 have the right framing (contrastive, conditional). Nobody has explicitly put them together.

This combination, i.e., differential transition operators, would let you say things like:

“Under the IOI task vs. its negative control, the layer-9 transition operator gains a rank-2 component that routes from position 7 to the residual at position 11, with information flow of $X$ bits per layer.”

This is a statement about how the computation differs between two conditions, not about which components fire (Axis 2), what direction the state shifts (Axis 3), or how the network flows on average (Axis 4). It is a fourth axis.

The toolbox already supports it. Block-Jacobian spectra trivially differ across conditions, you just have to compute them under a contrast pair. Koopman / DMD on residual streams admits a natural decomposition into condition-specific modes. Liang–Kleeman is defined contrastively. The ingredients are sitting on the shelf.

Attention vs MLP, and Scale

A natural objection: “but the IOI circuit is almost entirely attention heads, so why bother with transition operators when components already explain the behavior?”

IOI is GPT-2 small, i.e., 117M parameters and 12 layers, with a task picked specifically because it has a clean attention-routing structure. The folklore that

attention = routing/copying/induction, and MLP = key–value memory

is a useful heuristic at that scale, but it starts to break down before 1B parameters. Gated-FFN and MoE architectures push MLPs into routing-like behavior. Attention picks up more pure memorization, especially in long-context retrieval. The cleanly separable “pathway-share” you can compute on GPT-2 small becomes much fuzzier as the model grows.

What we do not have today is a unified pathway-share metric, i.e., one number per layer comparable across tasks and across model scales that says “of the work done at this layer, $\alpha$ went through attention and $1-\alpha$ through the MLP”. Such a metric is naturally a transition quantity, not a state quantity. It lives on $f_\ell$, not on $h_\ell$. The missing fourth axis is missing for an operational reason, not just a taxonomical one.

Closing

If you have spent your time reading the canonical interpretability blogs (Logit Lens, IOI, ACDC, the SAE papers, the steering-vector posts), your mental model of the field is probably organized by method. A more useful organization is by what the method tries to explain: states, causality, contrastive states, or transitions. The first three are mature. The fourth (dynamics) has a deep, careful, mostly-unread literature attached to it, but that literature studies the network unconditionally, which is exactly the thing interpretability does not need.

The next interpretability tool worth building, in my opinion, is a Koopman-style (or Jacobian-spectrum, or Liang–Kleeman) analysis run contrastively across two task conditions, and reported as a per-layer, per-pathway differential operator. The math is on the shelf.

Reference

K. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt, Interpretability in the wild: A circuit for indirect object identification in GPT-2 small, In Proc. International Conference on Learning Representations, 2023
N. Belrose, Z. Furman, L. Smith, D. Halawi, I. Ostrovsky, L. McKinney, S. Biderman, and J. Steinhardt, Eliciting latent predictions from transformers with the tuned lens, arXiv preprint arXiv:2303.08112, 2023.
A. Conmy, A. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso, Towards automated circuit discovery for mechanistic interpretability, In Proc. Advances in Neural Information Processing Systems, 2023
A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A. Dombrowski, and others, Representation engineering: A top-down approach to AI transparency, arXiv preprint arXiv:2310.01405, 2023.
J. Li and V. Papyan, Residual alignment: uncovering the mechanisms of residual networks, In Proc. Advances in Neural Information Processing Systems, 2023, vol. 36, pp. 57660–57712.

特倫蘇的日與夜