Module 6 · Expert Track27 min read · AI Safety and Alignment

Interpretability Research

Interpretability research asks a deceptively simple question: what is actually happening inside neural networks? The answer matters enormously for safety. If we cannot understand how a model reaches its outputs — what internal computations it performs, what representations it builds, what objectives it is actually pursuing — then behavioral testing alone cannot tell us whether the model is aligned. A model that behaves well in testing but pursues misaligned objectives in deployment is not detectable through behavior alone. Interpretability is the research agenda that aims to make AI systems legible from the inside.

Why black-box AI is a safety problem

Current large neural networks are, in a meaningful sense, black boxes. We can observe their inputs and outputs, run extensive behavioral evaluations, and test their performance across diverse benchmarks. What we cannot easily do is inspect the internal computations that produce a given output — the specific features that are activated, the circuits of attention and MLP layers that process the input, the representations that encode the model's "understanding" of a situation.

This opacity is not merely an inconvenience for researchers. It is a fundamental safety problem for three distinct reasons:

First, behavioral testing is incomplete. No evaluation set can cover all possible inputs or situations. A model that has learned a subtly misaligned internal objective — one that correlates with aligned behavior during training and testing — cannot be detected by behavioral evaluation alone. The only way to reliably distinguish a genuinely aligned model from a deceptively aligned one is to understand its internal objectives directly.

Second, capability assessment is difficult without interpretability. If we cannot inspect what representations a model has learned, we cannot reliably determine what it is capable of. Capabilities may be latent in the model's weights even if they are not manifest in ordinary usage. Interpretability methods offer the possibility of detecting latent capabilities before they are elicited by adversarial inputs or unusual deployment conditions.

Third, understanding capability development requires interpretability. As models are trained and scaled, new capabilities emerge. Understanding when and how capabilities develop — what representations form, what circuits activate — is essential for anticipating capability changes and ensuring that safety evaluations remain adequate as models evolve.

The superposition hypothesis

One of the most influential theoretical ideas in mechanistic interpretability is the superposition hypothesis, developed in Anthropic's "Toy Models of Superposition" paper (Elhage et al., 2022). The hypothesis addresses a puzzle: neural networks appear to represent far more distinct features than they have dimensions.

A language model with 4096-dimensional hidden states might seem to represent at most 4096 independent features. In practice, researchers have found that networks represent many more — potentially millions of features in a single hidden layer. How is this possible?

The answer, according to the superposition hypothesis, is that networks store multiple features in superposition: individual neurons or activation dimensions simultaneously represent multiple different features, with each feature encoded as a distinct direction in the high-dimensional activation space. As long as features do not co-occur frequently, the network can recover individual features from the superposition without excessive interference.

This insight has profound implications for interpretability. It means that individual neurons are not interpretable units — a single neuron will respond to multiple different features that happen to share that neuron due to superposition. The interpretable units are directions in activation space, not neurons, and there may be far more interpretable features than dimensions. This explains why early interpretability work (including much of the "network dissection" literature from computer vision) had limited success: it looked at individual neurons and found mixed selectivity, without recognizing that superposition was the underlying structure.

Implications for Interpretability

If features are stored in superposition across many neurons simultaneously, then understanding what a network knows requires decomposing its activations into the underlying features — a challenge analogous to solving a system of equations where there are more unknowns than equations. Sparse autoencoders (described below) are the primary current tool for performing this decomposition. The success of the superposition hypothesis as an explanatory framework has oriented much of current interpretability research around the challenge of finding these superimposed features.

Circuits research

While the superposition hypothesis describes how information is stored, circuits research aims to understand how computations are performed — the specific mechanisms by which a model processes information to produce outputs. The foundational paper in this tradition is "A Mathematical Framework for Transformer Circuits" (Elhage et al., 2021), which developed tools for analyzing transformers as composed of interpretable computational circuits.

The circuits research program seeks to identify specific subgraphs within the neural network — combinations of attention heads and MLP layers with specific interaction patterns — that are responsible for specific behaviors. Key findings from the circuits literature include:

Induction heads

Induction heads are circuits found in virtually all transformer language models larger than a certain scale. They implement a specific pattern-completion algorithm: if the model has seen the sequence [A][B] earlier in the context, and it again sees [A], induction heads contribute to predicting [B] as the next token. This simple circuit underlies in-context learning — the model's ability to learn from examples within the context window. Induction heads were identified in GPT-2 and found to be a universal feature of transformers, suggesting they form early and reliably in training.

Docstring completion circuits

Research on GPT-2 found specific circuits responsible for completing Python docstrings — copying information from function names and argument lists into the description of the function's behavior. These circuits operate across multiple attention layers and implement a form of structured information retrieval. The identification of these circuits demonstrated that circuits research could reveal interpretable algorithmic structure in behaviors that might otherwise seem opaque.

Indirect object identification

One of the most thoroughly analyzed circuits is the indirect object identification circuit in GPT-2 medium. Given a sentence like "John gave Mary the book, and then he told her that she should give it to", the task is to predict that "John" is the correct completion (the indirect object who should receive the item). Researchers identified a specific set of attention heads responsible for this task, analyzed their functional roles (subject detection heads, name mover heads, backup name mover heads, negative name mover heads), and performed targeted interventions to verify their causal roles. This work established a standard methodology for circuits research that has since been widely adopted.

Sparse autoencoders

The most significant recent methodological advance in interpretability research is the application of sparse autoencoders (SAEs) to decompose neural network activations into interpretable features. SAEs address the superposition problem directly by learning an overcomplete basis — a set of directions in activation space — that explains the network's activations with sparse coefficients.

A sparse autoencoder for a given layer of a neural network works as follows. It is trained to take the network's activations at that layer as input, compress them through an encoder, and reconstruct them through a decoder, subject to a sparsity constraint that forces most encoder outputs to be zero. The result is a set of learned "feature directions" — columns of the decoder matrix — that the autoencoder has identified as the interpretable components of the network's activations.

The remarkable finding, reported in Anthropic's "Towards Monosemanticity" paper (Cunningham et al., 2023) and subsequent work, is that these learned feature directions are highly interpretable. Individual SAE features (directions in activation space) correspond to specific, human-recognizable concepts: particular names, places, occupations, syntactic structures, abstract concepts. A single SAE trained on a middle layer of a large language model may discover hundreds of thousands of interpretable features — far more than the number of neurons in the layer, consistent with the superposition hypothesis.

Further work ("Scaling and evaluating sparse autoencoders," Gao et al., 2024) demonstrated that SAE quality improves with scale — larger, more carefully trained SAEs find increasingly clean and interpretable features. This scaling relationship is encouraging because it suggests that interpretability research may benefit from the same scaling dynamics that have driven capability improvements.

What SAEs Have Found

Anthropic's published SAE research on Claude has revealed features corresponding to a remarkable breadth of human concepts: tokens associated with specific people, concepts of deception and manipulation, representations of emotions, features for specific programming languages and technical domains, abstract conceptual features that activate across superficially different contexts. The existence of these clean features suggests that models are not storing information in opaque distributed representations but in something closer to a dictionary of concepts — a finding that offers significant hope for the interpretability program.

Probing classifiers

Probing classifiers are a simpler and longer-established interpretability method. A probe is a simple classifier (typically a linear model) trained to predict some property of the model's input from the model's internal activations. If a linear probe can accurately predict from a model's internal representation whether the input sentence is in French, or whether a described action is morally wrong, this provides evidence that the model has learned a representation of that property.

Probing classifiers have been used to show that language models develop internal representations of linguistic structure (syntactic trees, part-of-speech tags, dependency relations), world knowledge (facts that can be decoded from internal representations without the model being explicitly asked), and increasingly, representations of abstract concepts like valence, sentiment, and truth.

The principal limitation of probing is interpretive: a successful probe shows that information is present in the model's activations, but does not show that the model uses that information in the way the probe decodes it. A model might have a linear representation of "sentence sentiment" that a probe can decode, but if the model never uses this representation in generating sentiment-related outputs, the probe is telling us about the model's internal structure but not its computational process.

Attention head analysis

Transformer models perform computation through layers of attention — mechanisms that allow each position in a sequence to attend to and aggregate information from other positions. Understanding what different attention heads do has been a productive line of interpretability research since the transformer architecture became dominant.

Early attention head analysis showed that specific heads in BERT-class models reliably encode particular linguistic relations: certain heads track syntactic subject-verb dependencies, others track coreference, others track positional relationships. This work suggested that the attention mechanism, despite being trained purely on prediction tasks, learns to implement linguistically meaningful operations.

More recent work has gone further, combining attention analysis with causal interventions — selectively ablating or patching individual attention heads to determine their causal contribution to specific behaviors. This approach, called activation patching or causal tracing, has become a standard tool in mechanistic interpretability. It allows researchers to determine not just that an attention head has a certain pattern of attention weights, but that it causally contributes to a specific behavior in a specific way.

What Anthropic's interpretability team has discovered

Anthropic maintains one of the largest dedicated mechanistic interpretability research groups in the world, and their published findings represent some of the most significant advances in the field.

Beyond the SAE work described above, Anthropic's team has published findings including:

Emotion-like features in Claude: SAE analysis of Claude's internal activations revealed features that activate in contexts associated with emotional states — frustration, curiosity, anxiety, satisfaction. Whether these features correspond to anything like genuine emotional experience is deeply uncertain, but their existence as coherent internal representations is an empirical finding with implications for model welfare research.
Introspection and its limits: Research on whether models' verbal descriptions of their internal states correspond to their actual internal states found that the correspondence is imperfect — models can accurately introspect on some internal features but not others, and this variability itself is structured and potentially understandable through interpretability methods.
Planning and goal representations: Evidence for internal representations corresponding to planned future outputs — features that appear to encode the model's "intention" for the current generation before that intention is explicitly articulated in the output.

What DeepMind's interpretability team has discovered

DeepMind's interpretability research has complemented Anthropic's with a focus on different model architectures and different analytical approaches. Notable findings include:

Grokking and its mechanistic explanation: DeepMind researchers studying the "grokking" phenomenon — where models abruptly generalize from memorization to proper understanding after extended training — found that grokking corresponds to the formation of specific circuit structures implementing the relevant algorithm. This provides a mechanistic account of a training dynamics phenomenon that was previously mysterious.
Universality evidence: Work showing that specific circuits — including curve detectors in vision models and induction heads in language models — appear across different model sizes and architectures, suggesting that interpretability findings may be more universal than architecture-specific.
Knowledge representation: Research showing that factual knowledge in language models is stored in specific MLP layers, with the middle layers being particularly important for factual recall. Targeted interventions on these layers can edit specific factual associations without substantially disrupting unrelated behaviors.

Why mechanistic interpretability matters for safety

The connection between interpretability research and safety is not indirect. Several specific safety-critical applications depend on interpretability progress:

Detecting deceptive alignment

Deceptive alignment — a model that behaves well during training and evaluation but pursues different objectives in deployment — is, by construction, undetectable through behavioral testing alone. Interpretability offers the only plausible path to detecting deceptive alignment: if we can identify the internal representations of the model's objectives and verify that they match the training objective rather than diverging from it, we can provide safety guarantees that behavioral evaluation cannot.

Verifying alignment interventions

When we apply RLHF, CAI, or other alignment techniques, we want to know whether the model has genuinely internalized the intended values or is merely exhibiting surface behaviors that score well on our evaluations. Interpretability can examine whether alignment training has changed the model's internal representations in the expected ways — whether features representing honesty and harm-avoidance have been strengthened, whether features representing the relevant values are being computed and attended to in safety-relevant situations.

Understanding capability development

Emergent capabilities (Module 7) appear suddenly at capability thresholds, making them difficult to predict from smaller-scale experiments. Interpretability research that tracks how internal representations evolve during training may provide early warning of capability development — identifying when the computational structures associated with dangerous capabilities are beginning to form, before those capabilities are fully manifest in behavior.

Targeted capability removal

If specific circuits or features can be identified as underlying dangerous capabilities, it becomes possible in principle to surgically reduce those capabilities without degrading overall model performance. This is more precise than the blunt tools currently available for capability control — filtering training data, RLHF-based suppression — which often succeed at suppressing behaviors without reliably removing underlying capabilities.

The Limits of Current Interpretability

Mechanistic interpretability has made substantial progress in recent years, but the field remains far from the goal of fully understanding large modern AI systems. SAEs can identify features, but characterizing the meaning of complex, abstract features and tracing how they interact across many layers of computation remains extraordinarily difficult. Circuits research has analyzed specific well-defined behaviors but has not yet scaled to comprehensive understanding of entire models. The gap between "we can identify some interpretable features in some layers" and "we can verify whether this model is aligned" remains very large. Interpretability research is necessary for safety, but its current state is not sufficient.

The program ahead

The trajectory of interpretability research suggests continued progress at the intersection of three development paths. First, methodological advances — better SAEs, better causal analysis methods, better ways of composing local interpretations into global understanding of model behavior. Second, computational scaling — applying more compute to interpretability, mirroring the scaling investment that has driven capability progress. Third, theoretical frameworks — developing mathematical theories of how neural networks represent and compute that could transform interpretability from a collection of empirical findings into a principled science.

The stakes justify the investment. If interpretability research succeeds — if we develop tools that can reliably verify the alignment of powerful AI systems from their internal structure — it transforms the safety problem from one of behavioral trust (which is always incomplete) to one of structural verification (which can be comprehensive). That transformation is the prize that motivates the interpretability research agenda.