Module 522 min read · AI in Governance

Procurement and Oversight of Government AI Systems

Governments are the largest institutional purchasers of technology on earth. How they buy AI systems — what they require, what they scrutinize, what they leave unexamined — will shape how algorithmic decision-making affects citizens for decades. Procurement is not an administrative detail; it is one of the most consequential points of leverage government has over AI.

Why AI procurement is different

Traditional government software procurement has a logic built around stable specifications. An agency defines what it needs — a database to manage permits, a payroll system, a mapping application — and vendors bid on the ability to deliver those defined functions. The resulting contract specifies deliverables, performance metrics, and acceptance criteria. When the software is deployed, it behaves the same way today as it will in three years.

AI systems break this model in several important ways. Machine learning models are not static; they may drift as the data distribution they encounter in deployment diverges from the data they were trained on. A fraud detection model trained on historical patterns may perform well at first and degrade as fraudsters adapt. A predictive maintenance model for infrastructure may lose accuracy as the infrastructure itself ages. Traditional software procurement has no mechanism for specifying, testing, or monitoring this kind of temporal behavior.

AI systems also resist precise specification in ways traditional software does not. You can specify exactly what a payroll system must calculate; you cannot specify with equal precision what a computer vision system must "see" at every edge case. This creates a fundamental challenge for government purchasers who operate within legal frameworks that require clear, accountable, and auditable decisions. A system whose outputs cannot be fully predicted or explained may be technically capable but legally and ethically problematic.

The specification problem

Traditional procurement asks: does the system do what the contract specifies? AI procurement must also ask: does the system perform consistently across demographic groups? Does it remain accurate over time? Can it explain its outputs in terms a decision-maker can act on? Are there failure modes we haven't anticipated? These questions require expertise that most procurement offices do not currently have.

Defining requirements: AI in public sector RFPs

Request for Proposals (RFPs) for AI systems need to go well beyond functional requirements. Agencies increasingly recognize that they must specify not just what a system does, but how it does it, and what constraints it must operate within. This means building new categories of requirement into procurement documents.

Explainability requirements specify how the system must be able to account for its outputs. This is not a binary — "explainable vs. black box" — but a spectrum. A system might provide feature importance scores, counterfactual explanations ("this application was declined; had income been 15% higher, it would have been approved"), or simplified rule-based approximations of its behavior. The appropriate level of explainability depends on the stakes of the decision and the legal requirements that apply to it.

Fairness and bias requirements specify what the agency considers acceptable performance across demographic groups, and what metrics will be used to measure it. This requires the agency to make a prior decision about which fairness criterion applies — equal accuracy, equal false positive rates, equal positive predictive value — because these cannot all be simultaneously satisfied when base rates differ across groups.

Data provenance requirements specify what data the vendor is permitted to use in training, what documentation of that data must be provided, and what restrictions apply to data generated by the agency's own operations. Many agencies have discovered after deployment that vendors trained models on government data in ways that created vendor dependency or privacy exposure.

Functional requirements

What the system must do — the task it performs, the inputs it accepts, the outputs it produces, and the performance thresholds it must meet.

Explainability requirements

How the system must account for individual outputs, what documentation must accompany decisions, and whether human-readable explanations must be provided to affected individuals.

Equity requirements

What fairness metrics the system must satisfy, which demographic groups must be analyzed, what disparate impact thresholds are acceptable, and how ongoing monitoring will be conducted.

Security and reliability requirements

Resilience to adversarial inputs, data poisoning, model extraction attacks, and performance degradation; uptime requirements and disaster recovery provisions.

Human oversight requirements

Which decisions must be reviewed by a human before action is taken, what training staff must have to review AI outputs, and how overrides must be documented.

Algorithmic impact assessments before deployment

An algorithmic impact assessment (AIA) is a structured evaluation of a proposed AI system conducted before it is deployed in a consequential context. It borrows from the tradition of environmental and privacy impact assessments — systematic pre-deployment scrutiny of the potential harms a new capability could cause, with findings that inform whether and how to proceed.

A well-designed AIA examines the intended use of the system and all reasonably foreseeable misuses; the population that will be affected and any subpopulations with heightened vulnerability; the data on which the system was or will be trained, including its provenance, gaps, and historical biases; the accuracy and equity of the system across demographic groups; the legal implications of the system's outputs for affected individuals; and the governance structures that will oversee the system once deployed.

Canada's Directive on Automated Decision-Making, which came into force in 2019 and has been updated since, is the most developed national framework for AIAs in government. It establishes four impact levels based on the potential consequences of a decision — from administrative inconvenience to significant individual harm — and requires progressively more stringent assessment, transparency, and human review as the impact level rises. At the highest impact level, decisions must be reviewed by a qualified human before any action is taken, peer review of the system is required, and a plain-language explanation must be provided to any affected individual who requests one.

Canada's Directive: what it gets right

By tying requirements to impact level rather than to technology type, Canada's framework remains relevant as AI capabilities evolve. A novel AI system used to prioritize a low-stakes administrative task carries lighter requirements than a well-established system used to make consequential decisions about individuals. This proportionality is the right principle for government AI oversight.

Pre-procurement market engagement

Many procurement failures result from agencies writing requirements in a vacuum, without adequate understanding of what the market can actually provide. Pre-procurement market engagement — also called Requests for Information (RFIs), industry days, or market surveys — allows agencies to learn from vendors and civil society before committing to a specification.

For AI procurement, market engagement is especially valuable because the field moves quickly. A requirement written two years ago for a natural language processing system may not reflect what state-of-the-art systems can or cannot do today. Engaging with vendors, academic researchers, civil society organizations, and other governments before writing the RFP helps agencies write better requirements and avoid locking in outdated assumptions.

Market engagement must be conducted carefully to avoid giving any vendor unfair advantage in the subsequent competition. Best practice includes publishing a summary of market engagement findings, making the same questions available to all respondents, and ensuring that staff who conduct market engagement are not later involved in evaluating bids in ways that could introduce bias.

Vendor risk management and the dependency problem

Government AI procurement creates a category of risk that traditional software procurement also faces but AI amplifies: vendor dependency. When a government agency's operational capacity depends on a proprietary AI system that only one vendor can maintain, that vendor has extraordinary leverage. Switching costs are high, model documentation may be incomplete, and the agency may not have the internal expertise to evaluate whether the system is still performing adequately.

Vendor risk management for AI requires provisions that go beyond standard performance bonds and service level agreements. Agencies should require comprehensive model documentation — what training data was used, what validation was performed, what known limitations exist — that would allow another vendor or in-house team to replicate or replace the system. Source code escrow provisions may be appropriate for critical systems. Contracts should specify what happens to agency data if the vendor is acquired, goes out of business, or is acquired by a foreign entity.

The black box contract problem

Agencies that accept "proprietary model" provisions without requiring adequate documentation create a situation where they cannot evaluate the system's performance, cannot explain its outputs to affected individuals, and cannot replace it without starting from scratch. This is not merely a technical problem — it is a governance failure that creates legal exposure and undermines accountability.

Contract provisions should require vendors to provide documentation sufficient for an independent technical team to understand the system, audit its performance, and assess its ongoing reliability — even if the underlying model weights remain proprietary.

Algorithmic audits: structure and practice

An algorithmic audit is an independent examination of an AI system's behavior, conducted by parties with no financial interest in the system performing well. It differs from the vendor's own validation in that it is adversarial in spirit — the auditor is actively trying to find problems — and independent in structure — the auditor has access to the system but is not employed by or financially dependent on the vendor.

Audits can examine many dimensions of a system's behavior. Performance audits assess accuracy, recall, and precision, potentially disaggregated by demographic group. Fairness audits examine disparate impact in outputs and attempt to determine whether disparity, where it exists, is explainable by legitimate factors or reflects bias in training data or model design. Robustness audits probe the system's behavior under adversarial conditions, edge cases, and data distributions that differ from its training environment. Compliance audits assess whether the system operates within the legal constraints that apply to government decisions in the relevant domain.

The UK's Algorithmic Transparency Recording Standard, developed beginning in 2021, takes a different approach: rather than mandating independent audits, it requires agencies to publish structured information about algorithmic tools they use, the decisions they inform, and the oversight processes in place. This creates accountability through disclosure rather than through formal audit, making information available to civil society, researchers, and affected communities who can then conduct their own scrutiny.

Contract provisions for transparency and ongoing monitoring

The procurement contract is the moment at which government has the most leverage. Once a system is deployed and the agency has built operational dependencies on it, renegotiating terms is difficult. Transparency and oversight provisions should therefore be negotiated at the outset, not added as afterthoughts when problems emerge.

Key contract provisions for AI systems should include: regular performance reporting disaggregated by demographic group; notification requirements when the vendor makes material changes to the model; audit rights — the agency's ability to commission independent technical review of the system; incident reporting requirements when the system produces outputs that cause harm or that the vendor identifies as anomalous; and exit provisions that specify what documentation and data the agency receives if the contract ends.

Ongoing monitoring is not a post-procurement afterthought; it is a core governance function. AI systems can degrade in ways that are not immediately visible. A child welfare risk assessment tool may begin producing different outputs as the demographic composition of the caseload changes, without any change to the model itself. Monitoring protocols should specify what metrics will be tracked, at what frequency, by whom, and what triggers a formal review or suspension of the system.

Building internal capacity versus vendor dependency

There is a long-running debate in government technology about whether agencies should build internal technical capacity or rely on vendors for specialized expertise. AI sharpens this dilemma. AI systems are complex, evolve rapidly, and require ongoing technical judgment — not just at the point of procurement, but throughout the system's operational life.

An agency that lacks internal AI expertise cannot write good requirements, cannot evaluate vendor responses intelligently, cannot conduct meaningful oversight once a system is deployed, and cannot recognize when a system is performing badly. It is entirely dependent on the vendor's own assessment of the system's performance — a clear conflict of interest. Building internal capacity is therefore not a luxury; it is a prerequisite for responsible AI governance.

This does not mean every agency must employ data scientists who build models from scratch. It does mean having staff who understand enough about AI systems to ask the right questions of vendors, interpret performance reports critically, and know when to bring in independent expertise. Several governments have centralized this capacity — the UK's Government Digital Service, Canada's Digital Standards, and the U.S. General Services Administration's AI Center of Excellence are examples — providing specialized support to agencies that do not have it in-house.

Procurement red flags

Vendors who refuse to provide training data documentation; contracts that prohibit independent audits; performance metrics that cannot be disaggregated by demographic group; absence of model drift monitoring provisions; lack of human override mechanisms; and lock-in provisions that prevent the agency from switching vendors or building internal alternatives. Each of these should prompt significant scrutiny before a contract is signed.

Responsible AI procurement is ultimately about aligning the moment of maximum government leverage — the signing of a contract — with the values and accountability structures that should govern AI in public service. It requires technical sophistication, legal creativity, and institutional commitment. But it is also, in the end, a question of whether government takes seriously its obligation to understand and control the systems it uses to make decisions about citizens' lives.