Module 1 · Expert Track22 min read · AI Safety and Alignment

What Is AI Safety and Why It Matters

AI safety is the field dedicated to ensuring that artificial intelligence systems behave in ways that are beneficial, reliable, and aligned with human intentions — not merely in controlled test environments, but robustly across the full range of situations they will actually encounter. Understanding what AI safety is, and why it is technically hard, is the necessary foundation for everything that follows in this course.

Defining the field: safety, ethics, and governance

The terms AI safety, AI ethics, and AI governance are frequently conflated in public discourse, but they refer to meaningfully different bodies of work with different methods, timescales, and communities of practice.

AI ethics is the oldest and broadest of these fields, concerned with the moral dimensions of AI development and deployment. Questions of bias, fairness, transparency, accountability, and the distribution of AI's benefits and harms fall under its umbrella. AI ethics draws heavily from philosophy, social science, and law, and its practitioners typically focus on systems that exist today — facial recognition software that performs unequally across demographic groups, hiring algorithms that perpetuate historical discrimination, recommendation systems that amplify outrage. These are real problems causing real harm now, and they demand serious attention.

AI governance refers to the policy, regulatory, and institutional frameworks that societies use to oversee AI development and deployment. It includes questions about liability, intellectual property, standards bodies, procurement rules, and international agreements. Governance work happens largely in legal, regulatory, and political domains and is concerned with structuring incentives and accountability mechanisms at a societal scale.

AI safety is more specifically technical. It refers to the challenge of building AI systems that reliably do what their designers intend, remain under appropriate human control, and do not cause unintended harm — particularly as AI systems become more capable and autonomous. Safety research is concerned both with near-term issues (AI systems that fail in dangerous ways, that behave unexpectedly under distribution shift, that are vulnerable to adversarial attack) and with longer-term challenges that arise as AI systems approach and potentially exceed human-level performance across a broad range of tasks.

A useful distinction

Think of it this way: AI ethics asks "Is this AI system fair and just?" AI governance asks "How should society regulate AI?" AI safety asks "Will this AI system actually do what we want, reliably, even in novel situations and at high levels of capability?" The three questions overlap and inform each other, but they are distinct research programs.

The spectrum from near-term to long-term risk

One reason AI safety discussions can be confusing is that the field spans an enormous temporal and severity range. Practitioners differ significantly on which parts of this spectrum deserve the most attention, and those disagreements are substantive, not merely rhetorical.

At the near-term end of the spectrum sit risks that are already materializing: AI systems that produce confident misinformation, automated weapons systems with inadequate human oversight, AI-generated content used for fraud and manipulation, biometric surveillance enabling authoritarian control, and algorithmic systems that perpetuate discrimination in consequential decisions. These risks do not require speculative assumptions about future capabilities — they are documented phenomena in deployed systems today.

Moving toward the medium-term, researchers are concerned about AI systems that are increasingly capable but whose goals and values are poorly understood. A system powerful enough to significantly accelerate scientific research, manage critical infrastructure, or conduct sophisticated cyberoperations — but whose behavior in novel situations cannot be predicted with confidence — poses risks that are qualitatively different from a biased hiring algorithm. The asymmetry between capability and reliability becomes more dangerous as the stakes of deployment increase.

At the long-term end — the subject of the most heated debate — sit concerns about AI systems that might eventually exceed human capabilities across virtually all cognitively demanding tasks. Whether and when such systems might emerge is deeply uncertain, but if they did emerge before we had solved the fundamental problem of ensuring their goals aligned with human values, the consequences could be severe and difficult to reverse. This is the domain of what researchers sometimes call transformative AI risk or existential risk from AI.

Near-term safety: robustness and reliability

Current AI systems fail in predictable ways under distribution shift, adversarial inputs, and deployment conditions that differ from training. Making systems reliably safe in the real world — not just in test sets — is an active and important research problem.

Medium-term safety: oversight and control

As AI systems become more capable and autonomous, maintaining meaningful human oversight becomes both more important and more technically challenging. Research on corrigibility, interpretability, and value learning addresses this horizon.

Long-term safety: alignment at high capability

If AI systems reach or exceed human-level capability before we have solved the alignment problem, the consequences could be catastrophic and irreversible. This possibility motivates a significant portion of safety research even though its probability and timeline are genuinely uncertain.

Why safety is technically hard

The difficulty of AI safety is not primarily a matter of political will or corporate irresponsibility, though both of those matter. It is fundamentally a hard technical problem. Several interconnected challenges make building reliably safe AI systems difficult in ways that are not yet fully solved.

The first challenge is the specification problem: it is very difficult to formally specify what we actually want an AI system to do. Human values are complex, contextual, and often contradictory. When we try to capture them in a reward function or a training objective, we almost inevitably create a specification that is imperfect — one that matches human values in the situations we anticipated but diverges from them in situations we did not. We will explore this in depth in Module 2.

The second challenge is distributional shift: AI systems are trained on data from some distribution of situations, but they are deployed in environments that may differ from that training distribution in unpredictable ways. A system that behaves safely in its training environment may behave unsafely when it encounters genuinely novel situations — exactly the situations where reliable behavior matters most.

The third challenge is interpretability: we do not understand in detail how modern large neural networks represent information and make decisions. We cannot look at a model's weights and verify that it has the values we intended to instill, or predict how it will behave in situations it has not yet encountered. Safety claims about AI systems are therefore necessarily somewhat empirical — based on testing rather than formal verification — and may miss failure modes that testing has not revealed.

The fourth challenge is scalable oversight: as AI systems become more capable, the task of verifying their outputs and maintaining meaningful oversight becomes harder. If an AI system is smarter than its overseers at some task, those overseers may be unable to catch errors or detect misalignment. The methods we use to supervise AI systems today may not scale to the systems we are building toward.

The capability-safety gap

AI capabilities and AI safety are not developing at the same pace. Capabilities research — making AI systems smarter, faster, and more general — benefits from direct feedback: you know immediately whether the system performs better on the benchmark. Safety research is harder to evaluate: you rarely know definitively whether a system is safe until it fails in a costly way. This asymmetry in feedback creates institutional pressure that tends to favor capabilities over safety.

Key researchers and their contributions

AI safety as a distinct research field has developed primarily over the last two decades, though its conceptual roots trace back further. Several researchers have made foundational contributions that shaped how the field thinks about its core problems.

Stuart Russell, a professor at UC Berkeley and co-author of the definitive AI textbook, has articulated one of the most influential frameworks for thinking about the alignment problem. In his 2019 book Human Compatible, Russell argues that the standard model of AI — in which we program a fixed objective for the AI to maximize — is fundamentally flawed. He proposes instead that AI systems should be uncertain about human preferences and designed to defer to humans when uncertain. His work on Cooperative AI and assistance games represents a formalization of this intuition.

Paul Christiano is one of the most technically influential safety researchers of his generation. While at OpenAI, he developed the theoretical foundations of reinforcement learning from human feedback (RLHF), which became the dominant approach to aligning large language models. His research on amplification and debate as methods for scalable oversight has shaped the field's thinking about how to maintain alignment as systems grow more capable. He subsequently founded the Alignment Research Center (ARC).

Jan Leike led the alignment team at OpenAI and has been central to operationalizing alignment research — translating theoretical concerns about alignment into tractable research programs and practical interventions in deployed systems. His work helped establish RLHF as a practical technique and has consistently pushed the field toward empirical, measurable progress.

Yoshua Bengio is one of the founding figures of modern deep learning and a Turing Award recipient. His engagement with AI safety and existential risk, which became more prominent and public in 2023, has been significant in legitimizing these concerns within the mainstream machine learning community. Bengio has argued that the risks from advanced AI are serious enough to warrant substantial changes in how the field operates — a position that carries weight given his academic stature.

Key organizations

The institutional landscape of AI safety has evolved significantly and now includes both independent research organizations and safety teams embedded within AI labs.

Anthropic was founded in 2021 by Dario Amodei, Daniela Amodei, and others who departed OpenAI, with an explicit mission centered on AI safety. Anthropic publishes substantial safety research — including Constitutional AI, work on interpretability, and Claude's model card — and its stated goal is to be at the frontier of AI capabilities specifically in order to do safety research on frontier models. Anthropic's approach is notable for integrating safety research deeply into its model development rather than treating it as a separate compliance function.

The Alignment Research Center (ARC), founded by Paul Christiano, focuses on making progress on what it considers the most technically hard aspects of the alignment problem, particularly scalable oversight and evaluation of dangerous capabilities. ARC's Evals project has developed methods for evaluating whether AI systems have dangerous capabilities, and has collaborated with major labs on pre-deployment evaluations.

The Machine Intelligence Research Institute (MIRI), founded in 2000, has focused on the longer-term theoretical challenges of building provably aligned AI systems. MIRI's early work on AI risk scenarios and its research on agent foundations have shaped the conceptual vocabulary of the field, though its approach — emphasizing mathematical rigor and formal guarantees — has been more influential conceptually than technically in the era of large neural networks.

DeepMind Safety (now part of Google DeepMind) has produced important research on reward modeling, specification gaming, and scalable oversight. Their 2016 paper on concrete problems in AI safety, authored by Amodei, Olah, Steinhardt, Christiano, Schulman, and Mané, was instrumental in establishing a research agenda for the field and bringing safety concerns into the mainstream ML community.

The lab safety dilemma

There is an ongoing debate about whether AI safety research is best pursued inside frontier AI labs or at independent organizations. Labs have access to frontier models and the ability to directly influence what gets deployed. Independent organizations can pursue research without commercial pressure. Both models have produced important work, and both are necessary parts of a healthy research ecosystem.

What different communities mean by "AI safety"

One persistent source of confusion in discussions of AI safety is that the term means meaningfully different things to different communities. Understanding these differences is essential for engaging productively with debates in this space.

In the machine learning research community, "safety" often refers specifically to robustness — ensuring systems behave reliably under distribution shift, adversarial attack, and deployment conditions different from training. This is sometimes called "technical AI safety" in a narrow sense. Researchers working in this tradition are often skeptical of claims about longer-term existential risk, viewing them as speculative and a distraction from tractable near-term problems.

In the AI alignment community — influenced by researchers like Eliezer Yudkowsky, Paul Christiano, and others — "safety" refers primarily to the challenge of ensuring that very capable AI systems pursue goals that are actually beneficial to humanity, rather than goals that appear beneficial during training but diverge in deployment. This community tends to take long-term existential risk more seriously and places emphasis on theoretical foundations and on solving problems before they arise.

In the AI ethics and fairness community, "AI safety" often encompasses harm reduction for currently deployed systems: preventing discrimination, protecting privacy, ensuring accountability, and distributing AI's benefits equitably. This community is typically skeptical of long-term risk framings, which they sometimes view as distracting attention and resources from immediate, concrete harms being experienced by marginalized communities now.

In public policy circles, "AI safety" often means regulatory compliance, liability frameworks, and governance structures — a focus shaped by what is politically tractable and what existing legal frameworks can address.

Productive disagreement

These different meanings are not merely semantic confusion — they reflect genuine disagreements about where the most important risks lie, what timescales matter, and what kinds of solutions are tractable. The most productive researchers and policymakers in this space are able to engage seriously with perspectives they disagree with, rather than dismissing competing framings as simply wrong. This course aims to represent the full landscape of views fairly.

Why safety is a prerequisite for beneficial AI

One might ask: why is safety research necessary? If AI systems are trained to be helpful, and the organizations building them are well-intentioned, won't that be sufficient? The history of technology and the early history of AI deployment suggests the answer is no, for several reasons.

First, intent is not sufficient for safety. Well-intentioned designers can create systems with unintended failure modes. The history of software engineering, aviation, pharmaceutical development, and nuclear power is a history of catastrophic failures that nobody intended — failures that arose from the gap between designers' models of how systems behave and how they actually behave in complex, real-world conditions.

Second, competitive pressure creates incentives that can override good intentions. Organizations that slow down to invest in safety may lose market share to competitors who do not. Without shared norms, standards, and regulatory frameworks, safety investment is a competitive disadvantage — which is one of the core arguments for governance frameworks that level the playing field.

Third, the problem of ensuring that AI systems behave well in novel situations is genuinely unsolved. We do not yet have reliable methods for verifying that a trained AI system will behave safely across the full distribution of situations it might encounter. This is not a matter of insufficient effort; it is a research frontier with deep technical challenges that require serious investigation.

The field of AI safety exists because researchers recognized that making AI systems truly beneficial — not just useful in test environments, but reliably beneficial across the diverse and unpredictable conditions of real deployment — requires dedicated research and systematic effort. The technical challenges are real, the stakes are high, and the work is genuinely difficult. The remaining modules of this course examine these challenges in depth.