Module 7 · Expert Track22 min read · AI Safety and Alignment

Emergent Capabilities

Emergence — the appearance of qualitatively new capabilities at scale — is one of the most consequential and contested empirical phenomena in modern AI. When capabilities appear suddenly rather than gradually, they defeat the test-before-you-deploy strategy that governs most of safety engineering. You cannot evaluate a system for a capability it does not yet possess. Understanding emergence — what it is, when it is real, when it is a measurement artifact, and what it implies for safety — is essential for anyone working in AI development or governance.

What emergence means in the context of AI

The word "emergence" has a specific technical meaning in the AI literature. A capability is emergent if it appears abruptly when a model exceeds some scale threshold — in terms of parameters, training compute, or training data — rather than improving gradually and smoothly with scale. The key signature of emergence is not just that performance improves, but that it undergoes a phase transition: near-zero performance below a threshold, then rapid improvement above it, without a smooth interpolation in between.

This definition matters because gradual improvement and emergent improvement have very different implications for safety planning. Gradual improvement can be anticipated: if performance on a task improves by 5% with each doubling of parameters, you can predict that performance at the next scale level will also improve by approximately 5%. Emergence defeats this prediction: a capability may be completely absent at one scale level and fully present at the next, with no intermediate state that would have warned you of the incoming transition.

The original paper that established emergence as a formal phenomenon in AI was "Emergent Abilities of Large Language Models" (Wei et al., 2022), published by Google Brain. This paper documented a large number of specific tasks from the BIG-Bench benchmark on which model performance was flat (near random) below specific scale thresholds and then improved sharply, with performance plotted as a function of training compute or parameter count showing clear phase-transition-like patterns.

Documented examples of emergent capabilities

The empirical literature contains many well-documented examples of emergent capabilities in large language models. These examples are worth examining in detail because they illustrate both the phenomenon itself and its safety implications.

Chain-of-thought reasoning

Chain-of-thought prompting — asking a model to "think step by step" before giving its final answer — produces dramatically improved performance on multi-step reasoning tasks. But this improvement is emergent: it is essentially absent in models below approximately 100 billion parameters, and appears reliably only in models above that scale threshold. The same prompt that produces coherent step-by-step reasoning in a large model produces incoherent or misleading intermediate steps in a smaller model.

This is significant because chain-of-thought is not a capability that was explicitly trained for. Models were trained to predict text, not to reason step by step. The ability to produce useful reasoning chains emerged from scale and training data, not from explicit supervision of the reasoning process. This is the paradigm case of emergence in current AI: capabilities appear from scale alone, without having been specifically engineered.

Multi-step arithmetic

Performance on multi-step arithmetic tasks (additions and multiplications requiring more than a few operations) shows clear emergent patterns in the BIG-Bench evaluation suite. Below certain model scales, performance is at or below random chance for tasks requiring many steps. Above the threshold, models can solve these tasks reliably. The threshold is not a function of whether arithmetic was in the training data — it was — but of the model's capacity to perform multi-step computations reliably.

Language understanding without explicit multilingual training

Large models trained primarily on English text develop competence in other languages that were minority representations in the training data. At smaller scales, this transfer is minimal. At larger scales, the model has apparently learned general language representations sufficiently abstract that they transfer across languages. This represents an emergent generalization capability that was not explicitly trained for and could not have been predicted from the model's performance on English-only tasks at smaller scale.

Instruction following and in-context learning

The ability to follow explicit instructions without fine-tuning — to understand a prompt that says "translate the following to French" and act on it — is emergent in large models. Smaller models can recognize that the instruction is asking for translation but cannot reliably execute it. At larger scales, instruction following becomes reliable without any instruction-specific training. This is part of why GPT-3's few-shot in-context learning was seen as a qualitative leap: it demonstrated an emergent ability to understand and act on novel task specifications from context alone.

The BIG-Bench benchmark and emergent task evaluation

BIG-Bench (Beyond the Imitation Game Benchmark) is a collaborative benchmark developed by researchers across dozens of institutions specifically designed to evaluate large language models on tasks that current models cannot yet perform well — tasks at or beyond the frontier of model capability. It contains over 200 tasks spanning diverse domains and difficulty levels, with the explicit goal of identifying tasks that show emergent capability profiles.

BIG-Bench's design philosophy reflects the unique challenge of evaluating emergent capabilities: standard benchmarks saturate quickly as models improve, making them useless for tracking capability development at the frontier. BIG-Bench tasks were selected specifically to be difficult enough that current models performed poorly on them, ensuring that the benchmark would remain informative as models scaled.

Analysis of BIG-Bench results across model sizes revealed that a substantial minority of tasks show emergent profiles — near-flat performance below a scale threshold, followed by rapid improvement. This fraction of emergent tasks (estimates vary, but it is substantial) is a direct empirical demonstration that emergence is not confined to a few cherry-picked examples but is a systematic feature of how capabilities develop in large models.

Why emergence complicates safety

The safety implications of emergence are severe and specific. The standard safety engineering approach is: build a system, evaluate it for dangerous capabilities, deploy it only if evaluations are acceptable. This approach depends on the assumption that evaluations conducted before deployment reliably predict the system's capabilities during deployment.

Emergence breaks this assumption in a precise way. If a dangerous capability emerges only above some scale threshold, then evaluations conducted on sub-threshold models will not detect it — because it does not yet exist. A model that passes all safety evaluations at a given scale may develop dangerous capabilities at the next scale level. The evaluation passed is not evidence of safety at larger scale; it is only evidence of safety at the scale at which it was conducted.

This creates a genuinely hard problem for safety governance. Current responsible scaling policies (including Anthropic's Responsible Scaling Policy and OpenAI's preparedness framework) address this by conducting evaluations at each new scale level and establishing capability thresholds above which additional safety measures are required. But this approach is only as good as our ability to evaluate capabilities at each scale level — and if the capability we are most worried about is emergent (and therefore absent at current scale), we cannot evaluate for it directly.

The Evaluation Gap

The evaluation gap problem is one of the most important unsolved challenges in AI safety governance. If we cannot test for capabilities that models do not yet have, we cannot know in advance whether the next scale of deployment will cross a dangerous capability threshold. Interpretability research (Module 6) offers a partial answer: if we can identify the internal representations and circuits that underlie dangerous capabilities, we may be able to detect their formation before they are behaviorally manifest. But this approach remains research-stage. Current governance frameworks require evaluations on the model being deployed — which means emergent capabilities at the next scale level remain undetectable until they are in a deployed model.

The debate: is emergence real or a measurement artifact?

A significant scientific debate has emerged about whether the appearance of emergence is a genuine property of AI systems or a measurement artifact produced by discontinuous evaluation metrics. The central paper on this skeptical side is "Are Emergent Abilities of Large Language Models a Mirage?" (Schaeffer et al., 2023), from Stanford.

The argument is subtle and important. When we evaluate a model on a task, we use a metric. If the metric is discontinuous — for example, a "correct/incorrect" binary that requires all sub-steps of a reasoning chain to be correct — then even smooth, gradual underlying improvement can produce an apparent phase transition in the metric. A model that gets 50% of sub-steps right will score near zero on a task requiring all steps to be correct. A model that gets 80% of sub-steps right will score much higher. The underlying capability (per-step accuracy) is improving smoothly; the metric (whole-task accuracy) appears to make a discontinuous jump.

Schaeffer et al. demonstrated that when continuous evaluation metrics are used instead of binary correct/incorrect metrics, many of the apparently emergent capabilities in the Wei et al. paper become smoother and more gradual. The phase transition in the metric does not necessarily correspond to a phase transition in the underlying capability.

The response from the emergence proponents has been that some emergent phenomena are genuine even under continuous metrics, and that the metric-artifact account does not fully explain all documented cases. The current scientific consensus is nuanced: some apparent emergence is metric-artifact, some is genuine, and the tools to reliably distinguish the two are still being developed.

Why This Debate Matters for Safety

If emergence is primarily a measurement artifact, the safety concern is partially mitigated: capabilities are developing more gradually than the apparent phase transitions suggest, giving evaluators more warning time. If some emergence is genuine, then some capabilities may genuinely appear suddenly and cannot be predicted from sub-threshold performance. The debate does not have a clear safety-favorable resolution: even if most emergence is artifact, a small number of genuinely emergent dangerous capabilities would be sufficient to create serious safety risks. The possibility of genuine emergence in at least some capability dimensions must be taken seriously by safety governance frameworks regardless of how the scientific debate resolves.

Implications for capability forecasting

Capability forecasting — predicting what AI systems will be able to do at future scale levels — is one of the most practically important and technically difficult problems in AI governance. Emergence complicates forecasting in fundamental ways.

Smooth scaling laws (the observation that loss decreases predictably with compute) describe how model performance on next-token prediction improves with scale. These scaling laws have been remarkably consistent and have enabled reasonable predictions of pre-training performance at future compute levels. But next-token prediction loss does not directly translate into downstream task performance, and it provides essentially no information about which capabilities will emerge at which scale thresholds.

This means that even organizations with access to the best scaling law models face genuine uncertainty about what capabilities will be present in their next model. The gap between "we can predict the training loss" and "we can predict the capability profile" is large and currently unbridgeable. Organizations that have stated policies about not training models that would develop certain dangerous capabilities face the epistemically challenging task of making commitments about capabilities they cannot predict.

Implications for deployment decisions

Emergence affects deployment decisions through two distinct mechanisms. The first is the evaluation gap already discussed: deployed models may have capabilities that were not present at evaluation time. The second is post-deployment emergence: the behavior of a model in deployment may be different from its behavior in controlled evaluation, because deployment exposes the model to adversarial inputs, unusual contexts, and creative prompting strategies that evaluations did not anticipate.

Post-deployment emergence through adversarial elicitation is particularly concerning. There is substantial evidence that capabilities can be "unlocked" in deployed models through specific prompting strategies — jailbreaks, chain-of-thought prompts, specific instruction formats — that were not considered during evaluation. This means that an evaluation's finding that a model does not exhibit a dangerous capability may reflect the model's behavior under standard evaluation prompts rather than its behavior when adversarially probed.

Responsible scaling policies address this by mandating red-teaming — systematic adversarial probing — alongside standard evaluations. But red-teaming is limited by the creativity of the red-teamers and the adversarial techniques available at evaluation time. Novel elicitation techniques that emerge after deployment cannot be anticipated in pre-deployment evaluations.

Emergence and AI development strategy

The existence of emergence — or even the serious possibility of it — has significant implications for how AI development should be structured and governed. Several conclusions follow from taking emergence seriously:

  • Pre-registration of safety evaluations: Organizations should specify in advance which capability thresholds would trigger what safety responses, rather than making these decisions post-hoc after evaluations have been conducted. This prevents motivated reasoning from influencing the interpretation of evaluation results.
  • Mandatory evaluations at each scale increment: Responsible scaling policies (discussed in Module 10) are based on the recognition that emergent capabilities require evaluation at each new scale level, not just at initial deployment.
  • Investment in interpretability research: The only current path to predicting emergent capabilities before they appear is to understand what internal representations and circuits underlie those capabilities, and to track their development through training. This requires the interpretability tools discussed in Module 6.
  • Conservative deployment thresholds: Given the uncertainty about when dangerous capabilities might emerge, the precautionary case for conservative deployment thresholds — requiring evidence of safety rather than merely absence of evidence of danger — is strengthened by emergence. The asymmetry between the costs of excessive caution (capability delayed) and insufficient caution (dangerous capability deployed) should inform where the burden of proof lies.
Emergence as Evidence for Safety Investment

The existence of emergent capabilities is sometimes presented as an argument against optimism about AI safety — if capabilities appear without warning, how can safety be maintained? The better reading is the opposite: emergence makes the case for early and sustained investment in safety research more compelling, not less. The time to develop interpretability tools, evaluation methods, and governance frameworks is before dangerous capabilities emerge, not after. An AI safety field that is well-resourced and well-developed before dangerous capabilities appear is in a dramatically better position than one that scrambles to respond after the fact. Emergence is thus an argument for investment in safety research now, while the window is open.