Module 6 · Expert Track17 min read · AI Strategy for Leaders

Managing AI Projects and Vendors

The graveyard of enterprise AI is littered with projects that looked promising in discovery, started well in pilot, and then failed at scale. McKinsey estimates that fewer than 20% of AI pilots are eventually deployed in production at meaningful scale. Understanding why AI projects fail — and what management practices distinguish the successful from the abandoned — is one of the highest-leverage capabilities an executive can develop. This module focuses on the practical realities of managing AI projects and the vendors who support them.

Why AI Projects Differ from Software Projects

Leaders who have extensive experience managing software development projects frequently underestimate how different AI project management is. The mental models that work for traditional software — clear requirements, deterministic outputs, testable functionality, linear progress from specification to delivery — systematically mislead when applied to AI.

Non-Deterministic Outputs

Traditional software, given the same input, produces the same output every time. This predictability makes specification, testing, and acceptance criteria straightforward. AI systems are fundamentally probabilistic: they produce outputs based on learned statistical patterns, and those outputs vary — sometimes subtly, sometimes significantly — across runs, time periods, and input distributions. You cannot test an AI system against a fixed specification in the way you test software. Evaluation requires statistical thinking: how accurate is the system across a representative sample? What is the distribution of errors? How does performance degrade at the tails of the input distribution?

The practical implication is that "done" means something different in AI projects. A feature in a traditional software project is done when it works as specified. An AI model is never done in the same sense — it achieves some level of performance across some input distribution, and the questions are always: is this performance good enough for the use case? Is it stable? How does it behave on cases it has not seen? These are probabilistic judgments, not binary pass/fail tests, and leaders who insist on traditional software acceptance criteria for AI systems create impossible situations for their teams.

Data Dependency

Traditional software is primarily limited by engineering hours — more engineers, faster delivery. AI projects are primarily limited by data availability, quality, and labeling. A project that appears straightforwardly scoped can stall for months waiting for data that turns out to be unavailable, inaccessible due to privacy constraints, of insufficient quality to train on, or unlabeled and requiring expensive expert annotation. Data discovery — honestly assessing what data is available and in what condition — is one of the most critical early activities in any AI project, and it is systematically underinvested.

Evaluation Challenges

Defining what "good" looks like is harder for AI systems than for traditional software. An AI model that achieves 92% accuracy in classifying customer support tickets by category might seem impressive — but if the business requirement is routing tickets to the right team, and routing errors on the 8% generate disproportionate customer dissatisfaction, accuracy is the wrong metric. Designing evaluation frameworks that connect model performance metrics to business outcome metrics requires collaboration between AI practitioners and business stakeholders that many projects skip, leading to the common failure mode of technically successful models that do not solve the business problem.

The Evaluation Gap

In a widely cited analysis of enterprise AI project failures, Gartner found that evaluation framework weaknesses — teams building AI systems without clear success criteria connected to business outcomes — were a contributing factor in the majority of projects that failed to reach production. Building the evaluation framework before building the model, and getting business stakeholders to agree on it, is one of the most effective project management practices available.

Iterative AI Development Practices

The development practices that work for AI projects are inherently iterative — more closely analogous to hypothesis-driven research than to agile software development, although agile practices can be usefully adapted. The CRISP-DM framework (Cross-Industry Standard Process for Data Mining), while showing its age in an era of foundation models, correctly identifies the iterative, non-linear nature of AI development: understanding the business problem, understanding the data, preparing the data, modeling, evaluating, and deploying are phases that cycle back on each other rather than proceeding linearly.

In practice, successful AI project management looks like this: small, time-boxed experiments with clearly defined hypotheses; rapid evaluation against pre-defined success criteria; explicit go/no-go decisions at defined checkpoints; and a culture that treats early failure as informative rather than shameful. The organizations that manage AI projects well treat the discovery of a non-viable approach early in development as a success — they have avoided wasting resources on a path that would not have worked. The organizations that manage AI projects poorly treat any failure as a problem to be hidden from leadership, which results in zombie projects that consume resources without producing value.

The Two-Track Model

One of the most effective structural practices for AI project management is running two parallel tracks: a research track that explores the technical feasibility of the AI approach with maximum speed and flexibility, and an engineering track that builds the production infrastructure and integration framework. Many organizations make the mistake of waiting for the research track to complete before beginning engineering work, leading to a final integration phase where the technically-validated model turns out to be incompatible with production requirements. Running tracks in parallel, with defined integration milestones, significantly compresses the time from model validation to production deployment.

Managing Stakeholder Expectations

Stakeholder expectation management is one of the most politically sensitive and practically important aspects of AI project management. The challenge is that AI often enters organizations trailing a cloud of hype — executives who have read breathless coverage of AI capabilities, attended vendor demos showing cherry-picked examples, or compared their AI ambitions to highly publicized successes at technology companies arrive with expectations that are systematically inflated relative to what is achievable in their specific organizational context.

The leader's job is to manage this without crushing the enthusiasm that drives investment and organizational support. The most effective approach is concrete specificity: replacing vague discussions of AI potential with specific descriptions of what the project will accomplish, what it will not accomplish, what success looks like, what the timeline is, and what risks could prevent delivery. Vague enthusiasm breeds unrealistic expectations; concrete specificity enables informed judgment.

Regular structured communication — monthly written updates that include what was learned, what was not working, what was adjusted, and what is planned — maintains stakeholder trust through the inevitable turbulence of AI development better than infrequent dramatic updates. The worst pattern is radio silence followed by a major delay announcement, which destroys trust and gives stakeholders no opportunity to course-correct before a project is in serious trouble.

AI Project Failure Modes

Understanding the recurring patterns by which AI projects fail is essential for avoiding them. The following taxonomy draws on documented analysis from Gartner, McKinsey, and the academic AI deployment literature.

Data Problems

The most common failure mode. Data is discovered to be less available, lower quality, or harder to access than assumed at project initiation. Addressing this requires honest upfront data discovery — not optimistic assumptions — and a clear process for obtaining access to data that is currently unavailable. Many AI projects that fail on "data problems" are actually failing on governance problems: the data exists but cannot be accessed due to privacy restrictions, organizational silos, or lack of clear data ownership.

Scope Creep

AI projects are particularly vulnerable to scope expansion because stakeholders, seeing early results, immediately envision extensions and enhancements that "should not be hard to add." Scope creep in AI projects is particularly dangerous because it often involves adding capability requirements that change the fundamental nature of the problem — requiring new data, new model architectures, and new evaluation frameworks — not merely additional software features. Strict scope management and explicit change control processes are essential.

Metric Mismatch

The project optimizes for model performance metrics (accuracy, F1 score, AUC) while the business cares about outcome metrics (revenue impact, cost reduction, customer satisfaction) that are only loosely connected to the model metrics. Building the explicit bridge between technical metrics and business metrics — agreed upon before modeling begins — is the single most effective intervention against this failure mode.

The Pilot Trap

A technically successful pilot that was never designed to scale. AI pilots frequently succeed in controlled conditions — carefully curated data, dedicated team attention, accommodating user populations — and then fail to replicate success when deployed at scale to real production conditions. Designing pilots explicitly to test the assumptions that scaling would challenge, rather than to demonstrate success in ideal conditions, is the difference between pilots that inform and pilots that deceive.

Model Drift

AI models degrade over time as the real-world patterns they were trained on change. A model trained on 2022 customer behavior may perform poorly on 2025 customer behavior. Without systematic monitoring for performance degradation and defined processes for retraining or updating models, deployed AI systems gradually decay in performance — often without anyone noticing until the degradation is severe. AI projects must include production monitoring and maintenance plans, not just deployment plans.

Vendor Contracts for AI

Procurement and legal teams that negotiate AI vendor contracts using standard software contract templates will consistently produce agreements that fail to address the most important risks in AI vendor relationships. AI contracts require specific provisions that standard software contracts do not.

Data Rights and Privacy

The contract must explicitly address who owns the data used to train or fine-tune the AI system, whether the vendor has the right to use your data to improve their models, what data is transmitted to the vendor's systems during inference, and how data is retained and deleted. Many enterprise AI buyers have been surprised to discover that their vendor contracts permitted the vendor to use their operational data to train shared models — a provision that raises both competitive intelligence and privacy compliance concerns. Explicit, legally reviewed language on data rights is non-negotiable.

Performance Guarantees

Traditional software SLAs define availability and response time. AI systems require additional performance commitments: accuracy guarantees (what performance levels the vendor commits to on defined test sets), fairness commitments (that the system meets defined criteria for performance parity across demographic groups, where relevant), and drift monitoring obligations (that the vendor will notify you if model performance degrades beyond defined thresholds). These commitments are difficult for vendors to make without careful qualification, but their absence leaves buyers with no recourse when AI performance degrades.

Explainability and Audit Rights

In regulated industries — financial services, healthcare, insurance — the contract must specify what level of explainability the AI system provides and what audit rights the buyer retains. As AI regulation has intensified, regulators have increasingly required that financial institutions and healthcare organizations be able to explain how AI-driven decisions were reached. A vendor who cannot provide adequate explainability, or who claims trade secret protection over model internals in ways that prevent regulatory compliance, is a vendor you cannot safely use in regulated contexts.

Exit and Portability

Vendor lock-in is a significant risk in AI systems, where the model, training data, fine-tuning, and integration work may all be proprietary to the vendor. The contract should specify data portability requirements — you must be able to extract your data and fine-tuned model weights — and define a reasonable transition period and assistance obligation if you decide to switch vendors. Many organizations have discovered only at contract renewal that they had inadvertently accepted near-total lock-in, leaving them with no negotiating leverage.

Measuring Vendor Performance

Managing AI vendors requires a structured performance measurement framework that goes beyond the service level metrics of traditional IT contracts. A practical framework has three layers.

Technical performance metrics track the AI system's performance against defined benchmarks over time: accuracy on monthly held-out test sets, false positive and false negative rates, latency, and availability. These metrics should be measured by the buyer — not just reported by the vendor — using evaluation datasets that the vendor has not seen.

Business outcome metrics track whether the AI system is actually delivering the business results it was procured for. This requires establishing baselines before deployment and measuring impact in terms of the business outcomes that motivated the purchase — not just model performance. A customer service AI that maintains 90% accuracy but reduces customer satisfaction scores has a business performance problem regardless of its technical metrics.

Relationship and responsiveness metrics track the vendor's performance as a partner: response time to performance issues, quality of documentation and support, transparency about system changes, and proactiveness in communicating relevant developments. AI systems are living systems that evolve, and a vendor relationship that is not actively managed tends to drift toward the vendor's interests rather than yours.

The Management Principle

The organizations that manage AI projects and vendors most successfully treat these relationships as ongoing partnerships requiring active management — not as procurement events followed by passive consumption. They invest in the capacity to independently evaluate AI system performance, they maintain internal expertise sufficient to hold vendors accountable, and they build the contractual and technical flexibility to make changes when systems underperform. AI strategy without AI governance produces capabilities that erode.