Module 10 · Expert Track28 min read · AI Safety and Alignment

Paths to Safe AI

This final module is about tractability. The preceding nine modules have established that AI safety is a genuine, serious, and multidimensional challenge. This one addresses the more important question: what are we actually doing about it, and does it have a realistic chance of working? The answer — carefully hedged — is yes. The technical problems are hard but not obviously unsolvable. The governance challenges are real but not unprecedented. And the window in which decisive investments in safety can be made is open now, though it will not remain open indefinitely.

The technical agenda

The technical path to safe AI is not a single research program but a portfolio of complementary approaches, each addressing different aspects of the alignment and safety problems. No single technique is sufficient on its own; the case for tractability rests on the portfolio working together.

Interpretability research

The interpretability agenda (covered in detail in Module 6) is the most fundamental technical approach to AI safety. The core insight is that we cannot safely deploy systems whose internal operations are opaque. If we cannot determine what a model is actually optimizing for, what representations it has formed, or whether its apparent alignment is genuine or instrumental, we are deploying blindly. Interpretability research aims to change this by developing tools to understand model internals well enough to catch misalignment before it causes harm.

The current state of interpretability research is promising but incomplete. Mechanistic interpretability has produced significant results — the identification of circuits implementing specific capabilities, the superposition hypothesis explaining how models store many more features than they have dimensions, and the discovery of specific neurons implementing interpretable features. Anthropic's research has identified individual neurons in Claude's residual stream corresponding to specific concepts, and developed sparse autoencoders as tools for decomposing activations into interpretable features. But scaling these tools to models with hundreds of billions of parameters, and developing the understanding needed to reliably detect subtle misalignment, remains a major open challenge.

Scalable oversight

Scalable oversight (mentioned in earlier modules) addresses the problem that as AI systems become more capable, human evaluators may lose the ability to directly assess whether AI outputs are correct and aligned. If a superintelligent AI produces a proof of a major mathematical theorem, verifying it may be beyond any human's individual ability. If an AI system designs a novel drug, evaluating whether the design is actually optimal may require more domain expertise than any human reviewer possesses.

Two specific techniques have attracted the most research attention. Debate, proposed by Geoffrey Irving and Paul Christiano at OpenAI in 2018, proposes that two AI systems argue for different claims and a human judge evaluates the argument quality rather than the conclusions. The key insight is that verifying a proof of correctness can be easier than generating one: even a human who cannot independently solve a problem may be able to evaluate which of two competing arguments is stronger. Amplification, also developed by Christiano and colleagues, proposes bootstrapping human oversight by having humans use AI assistance to evaluate AI outputs — the human asks the AI system to help them understand the AI's own outputs, using the AI's knowledge to compensate for the human's limitations.

Both approaches face significant challenges. Debate only works if the human judge can reliably evaluate argument quality, which may not hold for highly technical or subtle questions. Amplification may not scale if the AI system being used for assistance is also potentially misaligned. Research in scalable oversight is active and has produced empirical results suggesting these concerns are not immediately fatal, but neither approach has been demonstrated at the scale where they would be most needed.

Automated alignment research

One of the most discussed and controversial ideas in AI safety is the possibility of using AI systems to assist in solving the alignment problem — what is sometimes called "using AI to align AI." The intuition is that the alignment problem may be too technically difficult for humans to solve unaided, and that AI systems with relevant capabilities might provide the research assistance necessary to make progress.

Anthropic's "Core Views" document and OpenAI's superalignment initiative both explicitly discuss the goal of developing AI systems capable of conducting alignment research, potentially accelerating the pace of progress enough to stay ahead of the capability curve. The superalignment team at OpenAI, before its dissolution in 2024, proposed using GPT-4 to evaluate the outputs of more capable models as a path toward scalable oversight.

The challenge — which Anthropic researchers have articulated clearly — is circular: using a potentially misaligned AI to help align other AI systems requires that the assisting AI system is itself sufficiently aligned to be trusted with that role. This circularity is not necessarily fatal if it can be bootstrapped: starting with a smaller, more interpretable model that can be verified as aligned, and using it to help align slightly more capable models, and iterating. But the bootstrap path requires that each step be reliably verified, which requires the interpretability tools that are themselves under development.

The Research Interdependence Problem

A recurring theme across the technical safety agenda is that the tools needed to verify each approach depend on other tools that are themselves under development. Scalable oversight depends on interpretability to verify that AI assistants are providing genuine help rather than misleading the human supervisor. Automated alignment research depends on scalable oversight to verify the AI's outputs. Interpretability research is trying to develop the foundational tools but has not yet reached the scale where the most important applications become possible. This interdependence is not a reason for pessimism — it accurately describes the state of a young field with genuine momentum — but it is a reason to be clear-eyed about what remains unsolved.

The governance agenda

Technical solutions are necessary but not sufficient for safe AI. Even if interpretability research provides reliable tools for verifying AI alignment, those tools will only be used if the incentive structures and governance frameworks require their use. The governance agenda addresses this: what policies, institutions, and international agreements are needed to ensure that AI development proceeds safely?

International coordination

The most important governance gap identified in Module 8 is the absence of international coordination on AI safety. The historical analogy most often cited is the nuclear nonproliferation framework: the Nuclear Non-Proliferation Treaty, the International Atomic Energy Agency's inspection regime, and the network of export controls and bilateral agreements that collectively govern nuclear materials and weapons development.

AI governance scholars including Allan Dafoe, Helen Toner, and the team at the Centre for the Governance of AI at Oxford have sketched frameworks for analogous AI governance structures. The core elements of proposals include: an international technical body (analogous to the IAEA) with the authority to conduct inspections of frontier AI training runs and access to evaluate deployed systems, a treaty framework establishing minimum safety standards for frontier AI development, and international compute governance mechanisms (building on the export control infrastructure discussed in Module 8) coordinated across chip-manufacturing and chip-consuming nations.

The obstacles are formidable. The nuclear nonproliferation framework took decades to develop and has significant gaps (several states have developed nuclear weapons outside the treaty framework). AI development is more distributed than nuclear development — the underlying technology is not as exclusively tied to rare physical materials — which makes containment harder. Geopolitical competition, particularly US-China strategic rivalry over AI dominance, creates strong incentives against meaningful coordination even when both parties might benefit from it. And the pace of AI development may outrun the pace of treaty negotiation.

Compute governance

As discussed in Module 8, compute governance — controlling access to the hardware necessary for frontier AI training — is potentially the most tractable near-term lever for international AI governance. The semiconductor supply chain is concentrated enough that a coordinated export control regime could be meaningfully enforced, at least for the most capable chips and the most advanced manufacturing equipment.

The Semiconductor Industry and Supply Chain Management (SISCM) proposals developed by researchers including Tim Fist and Tristan Croll propose requiring cryptographic reporting mechanisms built into high-performance AI chips that would allow governments to monitor large training runs. The idea is that each chip would record and report the existence (not the content) of training runs above specified thresholds, providing governments with visibility into frontier AI development that current voluntary reporting frameworks cannot guarantee. This "know your compute" approach mirrors the "know your customer" framework in financial regulation and is a technically specific proposal for how international compute monitoring could work.

Third-party audits and certification

The pharmaceutical and aviation industries have developed robust frameworks for third-party safety assessment: independent organizations evaluate products against specified safety standards before they can be deployed. AI safety governance increasingly recognizes that analogous frameworks are needed for frontier AI systems. The EU AI Act's conformity assessment requirements and the UK AISI's pre-deployment evaluations are the most advanced current implementations.

For third-party audit frameworks to be effective, three things are required. First, auditors need access to the models being evaluated — both behavioral access (the ability to run any prompt against the model) and weight access (the ability to run mechanistic interpretability analyses on the model's internal representations). Second, auditors need agreed evaluation standards — the set of tests that constitute a "pass" on safety requirements — which requires technical consensus about what dangerous capabilities look like. Third, the legal framework needs to give auditor findings meaningful consequence — the ability to delay or prevent deployment of models that fail safety assessments. Current frameworks are making progress on the first two but struggle with the third.

Whistleblower protections

A less discussed but potentially important governance mechanism is legal protection for AI safety whistleblowers — employees of AI labs who identify safety violations or concerns and report them to regulators or the public. The history of safety in other industries (aviation, pharmaceuticals, nuclear) includes cases where internal safety concerns were suppressed until they became public through the actions of whistleblowers. Legal frameworks that protect and incentivize this disclosure are part of any robust safety governance system.

Current US federal whistleblower protections (primarily under the Sarbanes-Oxley Act and the False Claims Act) apply primarily to financial fraud and government contractor misconduct, not to AI safety concerns. Developing legal frameworks specifically addressing AI safety whistleblowers — including protections against retaliation and safe channels for disclosure to regulators — is part of the governance agenda that remains largely unimplemented.

The cultural agenda

Beyond technical research and formal governance, the trajectory of AI safety depends on the culture of AI development — whether safety is treated as a core value by the people and organizations building AI systems, or as a compliance requirement to be minimized. Culture is harder to measure and govern than regulations, but it may be more durable and more important.

Safety as a core organizational value

The most successful safety cultures in high-stakes industries — commercial aviation, nuclear power operations, surgical safety in medicine — share a common feature: safety is not treated as opposed to the organization's core mission but as constitutive of it. Airlines do not succeed by sometimes getting planes to their destinations safely; safety is the foundation on which everything else is built. The cultural shift required in AI development is analogous: safety must be understood not as a constraint on capability development but as a prerequisite for the kind of long-term trust and deployment at scale that makes AI valuable at all.

Several frontier AI labs have made explicit commitments to this cultural orientation. Anthropic's public documentation describes the company as occupying a "peculiar position" — believing it may be building one of the most potentially dangerous technologies in history, while pressing ahead because being at the frontier of AI development is the best position from which to work on safety. Whether this position is coherent, and whether it is genuinely reflected in organizational priorities, is something outsiders can assess only imperfectly through the lab's public actions: what research it publishes, what it declines to deploy, and how it responds when safety and commercial pressures conflict.

Open publication of safety research

A specific cultural norm with significant consequences is the degree to which AI safety research is published openly. There is an inherent tension in safety research between openness (making safety tools and knowledge available to the entire field) and security (avoiding the publication of information that could be misused). The tension is real: a paper that describes how to elicit dangerous capabilities from AI systems could help safety researchers patch vulnerabilities and help bad actors exploit them.

The current norm in the field tilts toward openness for most safety research, with specific carve-outs for information with clear dual-use concern (detailed biosecurity evaluation methodologies, specific jailbreaking techniques that have not yet been patched). The Anthropic alignment science team, DeepMind's safety team, and academic groups at UC Berkeley (the Center for Human-Compatible AI), MIT, and elsewhere publish the large majority of their research. This openness has contributed to a field-wide discourse and collaborative culture that has accelerated safety research — researchers build on each other's work, identify flaws in proposed solutions, and develop shared standards. The alternative — safety research conducted entirely behind closed doors — would probably produce slower progress and less trust from the broader scientific community.

The funding and talent landscape

AI safety research is funded from several distinct sources with different implications for the field's direction and independence.

Open Philanthropy and EA-aligned funding

Open Philanthropy, the philanthropic organization primarily funded by Dustin Moskovitz (a Facebook co-founder) and Cari Tuna, has been the largest non-profit funder of AI safety research since approximately 2015. Open Philanthropy's grantmaking in AI safety reflects an effective altruist framework that prioritizes risks to humanity's long-term future, which leads to emphasis on existential-risk-focused work. Major grantees have included MIRI (the Machine Intelligence Research Institute), the Center for Human-Compatible AI at Berkeley, Redwood Research, ARC Evals (now Metr), the Centre for the Governance of AI at Oxford, and significant funding for safety teams at Anthropic and OpenAI.

Open Philanthropy's dominance in AI safety funding has both advantages and disadvantages. The advantage is that it has funded work on long-horizon, high-uncertainty research that commercially-oriented funders would not support. The disadvantage is that a field disproportionately funded by a single organization with a particular philosophical orientation may develop blind spots. The effective altruist community's emphasis on existential risk has at times been at tension with other researchers' focus on near-term harms, and funding structures that favor one framing over the other shape which research programs receive support.

Other significant funders include the Survival and Flourishing Fund (another EA-aligned organization), the Long-Term Future Fund, and a growing number of government programs including DARPA's AI safety programs and the UK's AI Safety Institute research budget. The funding landscape is becoming more diverse, which is broadly healthy for the field.

Frontier lab safety teams

Frontier AI laboratories have substantially expanded their internal safety teams over the past several years, driven by a combination of genuine organizational commitment, reputational pressure, and — increasingly — regulatory requirement. Anthropic's safety team spans multiple research areas including alignment science, interpretability, model behavior, and policy. Google DeepMind's safety team is one of the largest in the field. OpenAI's safety work continues across multiple teams despite significant organizational turbulence including the departure of the Superalignment team leads in 2024.

The key question about internal safety teams is whether they have meaningful organizational authority when safety concerns conflict with commercial or capability priorities. The departures from OpenAI in 2024 — including Ilya Sutskever, Jan Leike, and others from the safety organization — raised public concerns about whether safety considerations receive adequate organizational weight at the frontier labs. Structural features that support safety team authority include: direct reporting lines to the CEO or board, formal veto power over model releases, and transparency requirements that make it possible to identify when safety concerns have been overridden.

What individual practitioners can do

This course has been technical and field-level. What does it imply for someone who is at the beginning of a career in AI development, policy, or research?

Develop technical depth in safety-relevant areas

Interpretability, scalable oversight, formal verification, robustness, and evaluation methodology are the technical areas where safety expertise is most needed and currently undersupplied. A practitioner with deep expertise in one of these areas and a genuine commitment to applying it to safety problems is more valuable than one with broad but shallow safety knowledge.

Participate in red-teaming and responsible disclosure

Red-teaming programs at frontier labs are one of the most direct ways for technically skilled individuals to contribute to AI safety. When you identify a safety vulnerability in a deployed AI system, responsible disclosure — reporting it to the developer rather than publicizing it without notice — gives the developer the opportunity to address it before it is exploited. Most major AI labs have bug bounty or responsible disclosure programs.

Choose employers and projects with genuine safety commitments

Skilled AI researchers and engineers have significant leverage in the current labor market. Working for organizations with genuine safety commitments — organizations where safety teams have real authority, where safety research is published openly, and where deployment decisions reflect safety evaluations — is a direct way to strengthen the safety culture of the organizations doing frontier AI development.

Engage with governance as a technical expert

Policymakers developing AI governance frameworks need technical expertise to design regulations that are both effective and implementable. Providing technical input to regulatory processes — through comment periods, advisory roles, or direct engagement with policy organizations like GovAI, CAIS, or government AI offices — allows technical knowledge to shape governance in ways that generalist policy staff cannot achieve.

Maintain intellectual honesty about uncertainty

AI safety involves genuine uncertainty about probabilities, timelines, and which interventions will be most effective. Practitioners who express more confidence than the evidence supports — in either direction, whether claiming safety is solved or claiming catastrophe is inevitable — do more harm than good. Epistemic honesty about what we know, what we don't, and what evidence would change our views is a professional standard the field needs.

The relationship between safety and capability research

A persistent misconception about AI safety is that it is fundamentally in tension with capability research — that making AI more capable necessarily makes it less safe, and that making it safer requires accepting capability penalties. This framing has some truth in narrow contexts (safety training sometimes reduces performance on specific benchmarks) but is misleading as a general characterization of the relationship between the fields.

Many of the most important safety advances have come from understanding AI systems better — which is also what capability research requires. Mechanistic interpretability research that helps us understand how models work is also research that could inform more efficient training. Evaluation research that helps us assess what models can and cannot do serves both safety purposes (identifying dangerous capabilities) and capability purposes (identifying capability gaps). Robustness research that makes models reliably perform well under distribution shift serves both safety purposes (preventing unexpected failures) and deployment purposes (making products reliable).

The deeper insight is that capabilities and alignment are not the dimensions they appear to be from outside the field. A system that is more capable of understanding human intent is, in an important sense, more aligned. A system that is more robust to adversarial inputs is both safer and more capable. The framing of safety versus capability as a fundamental tradeoff reflects an early-stage view of the field that more mature research has increasingly complicated.

The Competitive Dynamics Question

The most challenging version of the safety-capability tradeoff is not technical but strategic: if safety requirements slow down one lab's development, competitors who do not face those requirements may reach critical capability thresholds first, potentially with less safe systems. This "race to the bottom" dynamic is a genuine concern that pure technical safety research cannot address. The governance agenda — specifically, frameworks that impose roughly equivalent safety requirements across competing organizations and jurisdictions — is the response to this concern. A safety practice that applies only to organizations that choose it voluntarily provides much weaker guarantees than one that applies across the field through legally binding requirements.

Cautious optimism and why the problem is tractable

It would be easy — and wrong — to conclude from the preceding nine modules that AI safety is an unsolvable problem, that catastrophe is inevitable, or that the field is too far behind the capability curve to make a difference. None of these conclusions follow from a clear-eyed reading of the evidence.

The problem is tractable for specific, concrete reasons. First, interpretability research is making genuine progress. The sparse autoencoder work at Anthropic, the circuit-level understanding achieved for specific capabilities, and the development of better tools for analyzing model internals represent real advances, not just optimistic claims. The pace of progress is slower than the pace of capability development, but the gap is not obviously unbridgeable given the current level of investment.

Second, governance frameworks are being built in the window before catastrophic capabilities emerge. The EU AI Act, the UK AISI, the US executive order, and the international dialogue frameworks established at the 2023 and 2024 AI Safety Summits are imperfect but real. Governance that is established before a crisis is more likely to be well-designed and durable than governance that is scrambled together in response to a catastrophic event.

Third, the talent and funding flowing into safety research have increased dramatically since 2020. The number of serious researchers working on alignment, interpretability, governance, and safety evaluation is larger than at any previous point in the field's history. The quality and depth of the research have improved correspondingly.

Fourth, and perhaps most importantly, the leading AI labs — whatever their internal tensions and competitive pressures — have made concrete safety commitments that constrain their behavior in ways that would have been unimaginable five years ago. Responsible scaling policies, pre-deployment evaluations, mandatory red-teaming, and engagement with external oversight bodies are now standard practice at Anthropic, Google DeepMind, and (with more variability) OpenAI. These commitments are imperfect and could be abandoned under sufficient competitive pressure, but they represent a real institutional shift in how frontier AI development is conducted.

The Case for Optimism Is Conditional

Cautious optimism is not unconditional optimism. The case for tractability depends on continued investment in technical safety research at a pace that keeps up with capability development, governance frameworks that are binding rather than merely voluntary, and international cooperation that prevents races to the bottom across competing jurisdictions. Each of these conditions is plausible but not guaranteed. The outcome is not predetermined — it depends critically on what the people working in and around AI development choose to do with the knowledge and leverage they have. That is not a comfortable place to land, but it is the accurate one. The problem is solvable. Whether it is solved depends on choices being made now.