Module 8 · Expert Track24 min read · AI Safety and Alignment

AI Governance

Technical alignment research addresses the question of whether AI systems can be made to behave as intended. AI governance addresses the question of whether they will be — a question that depends not just on the capabilities of researchers but on the incentives, institutional structures, legal frameworks, and international agreements that shape who builds AI, how, and under what constraints. These two problems are separable but inseparable in practice: even a technically solvable alignment problem remains unsolved if governance structures fail to mandate safe practices.

The governance landscape in 2024

AI governance has moved from academic discussion to active legislation faster than almost any comparable technology. In 2023 and 2024 alone, the European Union finalized the world's first comprehensive AI regulation, the United States issued a sweeping executive order, the United Kingdom established a dedicated AI Safety Institute, China published binding regulations on generative AI, and the G7 adopted voluntary governance principles. The pace of activity reflects both genuine concern and competitive pressure — every major jurisdiction fears being left behind both in capability and in being seen to lead on safety.

The governance approaches taken by different jurisdictions reflect genuinely different political philosophies and risk assessments. Understanding those differences — and their practical implications for how AI development unfolds — is essential for anyone working at the frontier of AI or in AI policy.

The EU AI Act

The EU AI Act, which entered into force in August 2024, is the world's first comprehensive binding legal framework for AI systems. Its structure reflects a risk-based approach: systems are classified into four risk tiers — unacceptable risk (banned outright), high risk (subject to mandatory conformity assessments), limited risk (transparency requirements only), and minimal risk (effectively unregulated).

The risk classification system

Unacceptable-risk systems include social scoring systems used by governments, real-time biometric surveillance in public spaces (with narrow exceptions), and subliminal manipulation that bypasses rational agency. These are prohibited regardless of technical design or claimed safeguards. The prohibition on social scoring is a direct response to China's social credit system; the biometric surveillance prohibition responds to proposals for city-wide facial recognition deployments in European cities.

High-risk systems — those used in critical infrastructure, education, employment decisions, credit scoring, law enforcement, migration, and administration of justice — must comply with mandatory requirements before they can be deployed. These include: risk management systems, data governance procedures, technical documentation, logging requirements, human oversight mechanisms, accuracy and robustness requirements, and cybersecurity protections. Third-party conformity assessments are required for the highest-stakes applications.

General-purpose AI and frontier model provisions

The Act's treatment of general-purpose AI (GPAI) models — which includes large language models like GPT-4 and Claude — was the most contentious part of the negotiation. The final text creates a two-tier GPAI regime. Standard GPAI models must provide technical documentation, comply with copyright law, and publish training data summaries. Models above a compute threshold of 10^25 FLOPS — a rough proxy for "frontier" models — face additional requirements: mandatory adversarial testing, incident reporting to the EU AI Office, cybersecurity protections, and energy efficiency reporting.

The compute threshold matters because it is a current-state proxy for capability, not a capability assessment itself. This mirrors the logic of responsible scaling policies (discussed below) while translating it into a regulatory framework that regulators can administer without conducting their own capability evaluations. It is an imperfect proxy — a highly capable model trained efficiently might fall below the threshold — but it is currently the only administratively tractable approach.

The EU AI Office

The EU AI Act created the AI Office, housed within the European Commission, as the central body responsible for overseeing GPAI models across all EU member states. The AI Office has investigative powers, can conduct technical evaluations, and can impose fines of up to 3% of global annual revenue (or €15 million for GPAI model providers, whichever is higher) for violations. This is the first EU-level body with direct authority over AI companies rather than relying on member-state enforcement — a significant institutional innovation with implications for how AI governance is structured globally.

US executive and legislative activity

The United States has taken a different approach, relying primarily on executive action and voluntary commitments rather than binding legislation, reflecting both the constitutional structure of US government (where Congress moves slowly on complex technical issues) and a political economy that is more skeptical of regulation affecting a major domestic industry.

Executive Order 14110

Executive Order 14110 on Safe, Secure, and Trustworthy AI, signed by President Biden in October 2023, was the most comprehensive federal AI governance action to date. Its key provisions included: mandatory reporting requirements for AI models trained above a compute threshold (initially 10^26 FLOPS for dual-use foundation models), directing NIST to develop AI safety and security standards, requiring federal agencies to appoint Chief AI Officers, directing DHS to assess AI risks to critical infrastructure, and instructing the State Department to develop international AI governance frameworks.

The compute reporting requirement — requiring developers to notify the federal government of training runs above the threshold — was designed to give the government visibility into the frontier AI landscape without imposing substantive restrictions. This is a lighter touch than the EU approach: notification, not permission. The threshold itself was set at 10^26 FLOPS, higher than the EU's 10^25 FLOPS threshold, reflecting the administration's concern about imposing requirements on a wider range of models.

Voluntary commitments and the White House framework

In parallel with formal rulemaking, the Biden administration secured voluntary safety commitments from seven major AI companies (Anthropic, Google, Microsoft, Meta, Amazon, Inflection, and OpenAI) in July 2023, and an expanded set of commitments from fifteen companies in September 2023. The commitments included: pre-deployment safety testing, information sharing on safety incidents, investment in cybersecurity, and public reporting on capabilities and limitations.

The voluntary commitment approach has been criticized as unenforceable and insufficient relative to the risk — companies that make voluntary commitments can withdraw from them without legal consequence. Defenders argue that voluntary commitments are faster to implement than legislation, can be updated more easily as the technology evolves, and that enforcement through reputational consequences may be meaningful for companies whose products depend on public trust.

The UK AI Safety Institute

The United Kingdom established the AI Safety Institute (AISI) in November 2023, housed within the Department for Science, Innovation and Technology. The AISI is notable for being the world's first government body specifically tasked with evaluating the safety of frontier AI models — a capability-assessment function rather than a regulatory function.

The AISI's approach involves direct access to frontier models before their public deployment for structured safety evaluations. The Institute has signed agreements with Anthropic, OpenAI, Google DeepMind, and other frontier labs giving it pre-deployment evaluation access. Its evaluations focus on dangerous capabilities — particularly in the domains of chemical, biological, radiological, and nuclear (CBRN) threats, cyberoffense, and autonomous AI behavior — rather than on broad product safety.

The AISI Model

The AISI's evaluation-focused model is arguably more technically sophisticated than purely regulatory approaches. Rather than specifying in advance what systems may not be capable of (which requires knowing what dangerous capabilities look like), the AISI tests actual model behavior against structured threat scenarios. This generates empirical data about actual system behavior rather than relying on developer self-certification. The limitation is that the AISI has no enforcement authority — it can evaluate, report, and advise, but cannot compel developers to change their systems or delay deployment. The relationship is cooperative, not coercive.

China's AI governance approach

China has taken a different path, implementing binding regulations faster than any Western democracy but focusing primarily on content governance and the political implications of generative AI rather than on safety in the technical alignment sense.

The Interim Measures for the Management of Generative AI Services, effective August 2023, require providers of generative AI services in China to: register with the Cyberspace Administration of China (CAC), conduct security assessments, label AI-generated content, protect training data privacy, and ensure that AI outputs "uphold core socialist values" and do not subvert state power or spread "false information." The political content requirements are the most distinctive feature of China's approach — the regulation is as much concerned with preventing AI from being used against the state as with preventing harm to users.

China also published a framework for the regulation of algorithms in 2022 and has implemented content controls on recommendation systems that have no Western equivalent. The overall approach reflects a governance philosophy in which the state's relationship to AI is one of control and integration, not just oversight and safety.

Compute governance and chip controls

Compute governance — the idea that controlling access to the hardware required to train frontier AI systems is a tractable lever for AI governance — has become one of the most consequential and contested areas of AI policy.

The logic of compute governance

The argument for compute governance rests on three observations. First, frontier AI training is extremely compute-intensive and requires specialized hardware (primarily Nvidia A100 and H100 GPUs and their successors) that is produced by a very small number of companies in a very small number of locations. Second, these chips flow through identifiable supply chains that governments can monitor and restrict. Third, unlike algorithms (which are easily copied and shared) or data (which is difficult to control), physical hardware has a specific location at all times and is subject to import/export controls.

This analysis suggests that controlling the flow of advanced AI chips — particularly export controls preventing them from reaching actors who might develop dangerous AI — is a more tractable governance lever than trying to regulate algorithms or training procedures directly.

US export controls

The Biden administration's export controls on advanced chips to China, announced in October 2022 and significantly expanded in October 2023, are the most significant application of compute governance to date. The controls restrict the export of Nvidia A100 and H100 chips, and chips meeting equivalent performance thresholds, to China and certain other countries. They also restrict the export of the manufacturing equipment required to produce these chips, targeting ASML (the Dutch company that makes the extreme ultraviolet lithography machines required for leading-edge chip fabrication) and Applied Materials, Lam Research, and KLA Corporation (US equipment makers).

The goal was twofold: to prevent China from training frontier AI models for military applications, and to slow China's development of domestic semiconductor manufacturing capacity. The controls have had measurable effects on Nvidia's China revenue and on the availability of advanced chips in China, though the extent to which they have actually delayed AI development — as opposed to diverting it toward alternative chips and domestic production — is disputed.

The Evasion Problem

Compute governance through export controls faces a fundamental evasion problem. Advanced chips can be purchased through third countries that are not subject to controls and then transshipped to restricted destinations. The US government has identified significant transshipment through distributors in the UAE, Malaysia, and other countries. Addressing this requires coordinating export controls across many governments — a difficult diplomatic task. More fundamentally, if domestic chip manufacturing capacity develops in restricted countries, export controls become ineffective over the medium term. The question of whether compute governance buys years, decades, or only months of meaningful constraint is unresolved.

Responsible Scaling Policies

Responsible scaling policies (RSPs) are voluntary frameworks adopted by AI labs themselves that commit to specific safety requirements before training or deploying models above capability thresholds. They represent an attempt by leading AI companies to self-impose governance structures that fill the gap left by slow-moving legislation.

Anthropic's Responsible Scaling Policy

Anthropic published the first formal RSP in September 2023. The policy defines a series of "AI Safety Levels" (ASLs) analogous to biosafety levels in laboratory biosafety frameworks. Each ASL corresponds to a capability level assessed through structured evaluations. The key mechanism is a "don't train past a capability threshold without having mitigations in place" commitment: Anthropic commits not to train models that would exceed a given ASL unless it has either demonstrated that the model does not have the relevant dangerous capabilities, or developed and implemented safety measures sufficient to manage the risk at that capability level.

The ASL framework is specific about what capabilities trigger which levels. ASL-3 — the level requiring the most stringent current mitigations — is triggered by models that could provide "serious uplift" to actors seeking to create biological, chemical, nuclear, or radiological weapons, or by models capable of "meaningful" autonomous operation that could enable AI to attack critical infrastructure. The policy requires specific evaluations to determine whether a model is at ASL-3, and specifies what security and deployment controls are required for ASL-3 models.

OpenAI's Preparedness Framework

OpenAI published its Preparedness Framework in November 2023. The framework uses a four-level risk taxonomy (low, medium, high, critical) applied across four domains: cybersecurity, biological threats, nuclear/radiological threats, and autonomous AI behavior. The key commitment is that models assessed as "high" or "critical" risk in any domain will not be deployed without mitigations in place, and models assessed as "critical" in any domain will not be trained to that capability level at all.

The Preparedness Framework also established the Preparedness team at OpenAI, with a mandate to conduct capability evaluations, red-teaming, and safety research specifically targeting dangerous capabilities. The team reports directly to the CEO and has a formal "safety advisory group" that reviews their assessments. This institutional structure — a safety team with direct CEO access — is designed to give safety concerns organizational weight against product and capability teams.

Third-party evaluations and red-teaming mandates

A recurring theme across both voluntary RSPs and formal regulation is the requirement for external evaluation — having parties other than the developer assess a model's safety properties. This reflects a general principle from product safety regulation: self-certification by the manufacturer is insufficient for high-risk products.

The EU AI Act requires third-party conformity assessments for high-risk AI systems, and mandates that GPAI models above the compute threshold undergo adversarial testing. The UK AISI conducts pre-deployment evaluations. Several of the US voluntary commitments include sharing information about safety evaluations with third parties. This emerging norm of external evaluation mirrors pharmaceutical clinical trials, aviation safety certification, and nuclear facility inspections — domains where catastrophic failure risk is high enough that self-certification is considered inadequate.

The red-teaming ecosystem

Red-teaming — the practice of having dedicated adversarial testers attempt to elicit dangerous behaviors from AI systems — has developed into a distinct professional discipline. Organizations including Scale AI, Redwood Research, ARC Evals (now part of Metr), and various academic groups conduct structured red-teaming for AI labs. The Biden executive order directed NIST to develop standards for AI red-teaming, and the EU AI Act references adversarial testing requirements for frontier models.

Red-teaming's effectiveness as a safety measure depends critically on the quality and creativity of the testers, the adversarial scenarios they explore, and whether the model is evaluated in conditions similar to real-world deployment. A model can pass a red-team evaluation conducted under constrained conditions and still exhibit dangerous behaviors when deployed against creative adversarial users at scale. Current red-teaming is necessary but not sufficient as a safety guarantee.

Governance gaps and enforcement challenges

The current governance landscape has significant gaps. The most important are: the absence of international coordination, the difficulty of enforcing compute thresholds as metrics for regulatory triggering, the challenge of governing open-source models, and the weakness of enforcement mechanisms for the most consequential potential harms.

The open-source governance problem

Open-source AI models — including Meta's Llama series, Mistral, and various community fine-tunes — present a distinct governance challenge. Once model weights are publicly released, they cannot be recalled or restricted. The EU AI Act and most national frameworks focus on commercial deployment and require registration and compliance from model providers — but if the model is open-source, there is no single provider to regulate, and users who deploy the model for harmful purposes may be beyond the reach of any jurisdiction.

The governance debate over open-source AI reflects a genuine tension. Open-source releases democratize access to AI capabilities, enable research, and prevent monopolization of transformative technology by a small number of companies. They also put dangerous capabilities in the hands of actors who may have no interest in safety practices and no accountability to regulators. Current governance frameworks have not resolved this tension; they have largely deferred it.

The International Coordination Gap

The most fundamental governance gap is the absence of international coordination. AI development is global; governance is national. A company or government that trains a dangerous AI system in a jurisdiction with minimal regulation creates a risk that falls on the entire world. The analogy to nuclear nonproliferation is imperfect but instructive: nuclear governance required international treaties with inspection regimes and export controls. AI governance will ultimately require similar international mechanisms. The 2023 Bletchley Declaration — signed by 28 governments at the UK AI Safety Summit — acknowledged the need for international cooperation but created no binding commitments or enforcement mechanisms. It is a starting point, not a solution.

The dual challenge: safety without stifling

Every governance framework faces the challenge of imposing meaningful safety requirements without stifling beneficial AI development. This tension is real and politically significant. Regulatory compliance costs fall disproportionately on smaller organizations and may entrench the position of large incumbents who can absorb compliance costs. Requirements that are technically demanding may be feasible only for well-resourced labs, effectively creating regulatory barriers to entry that concentrate AI development in a small number of companies. Governance frameworks that slow the development of beneficial AI applications — in healthcare, climate, education — impose real costs that must be weighed against safety benefits.

The tension between safety and innovation is not a reason to abandon governance, but it is a reason to design governance carefully. Requirements calibrated to actual risk — proportionate to the capability level and deployment context of a system — impose less unnecessary burden than requirements applied uniformly across all AI systems. The risk-based approach of the EU AI Act is specifically designed to address this: systems with minimal risk face minimal requirements, while frontier systems with genuinely dangerous potential face stringent ones. Whether the calibration is well-tuned is an empirical question that will unfold as the Act is implemented.

The Governance Opportunity

The current moment — when frontier AI is being deployed at scale but before the most dangerous potential capabilities have emerged — is the optimal window for establishing governance frameworks. Governance that is built before a crisis is more likely to be thoughtful, well-calibrated, and durable than governance built in reaction to a catastrophic failure. The historical pattern in technology governance (nuclear, aviation, pharmaceuticals, financial regulation) is that serious governance comes after a disaster. In AI, the opportunity to establish governance before the disaster occurs is being actively seized, however imperfectly. This is genuinely unprecedented and worth protecting.