Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 15 min read

The AI Project Cost-of-Quality Framework

The AI Project Cost-of-Quality Framework

Cost-of-quality (CoQ) is the most useful 60-year-old framework in software economics, and it has not been adapted carefully for AI projects until now. The original Joseph Juran decomposition splits quality spend into four buckets: prevention, appraisal, internal failure, and external failure. The classic finding is that under-invested prevention causes external failure cost to dominate, and that the optimal mix sits at roughly 60 percent prevention plus appraisal, 25 percent internal failure, 15 percent external failure for mature programs. AI projects map onto the four-bucket model cleanly, but the magnitudes are different. Prevention in AI is eval suite construction and threshold locking. Appraisal is running evals continuously and on most model upgrade. Internal failure is regression triage caught before production. External failure is a production incident with brand and contractual cost. The optimal AI mix runs roughly 50 percent prevention, 20 percent appraisal, 20 percent internal failure, 10 percent external failure; and most AI projects in 2026 spend less than 15 percent on prevention and pay 35 percent or more in external failure cost. This piece adapts the CoQ framework for AI work and gives quantified ranges for each bucket.

This is a spoke under the AI project economics manifesto. The manifesto names evaluation cost as the new unit of account; cost-of-quality is the framework that decomposes that unit of account into the four spend categories most CFO already understands from manufacturing and traditional software.

The four buckets translated to AI

Juran’s 1951 framework names four spend categories that together comprise total cost of quality. The translation to AI projects:

BucketTraditional softwareAI project
PreventionCode review, static analysis, design docsEval suite construction, threshold locking, prompt registry, retrieval design
AppraisalQA testing, integration testingRunning evals on most change, model-upgrade re-evals, drift monitoring
Internal failureBug-fixing pre-releaseRegression triage, eval threshold misses caught before deploy
External failureProduction bugs, customer supportProduction incidents, hallucination escapes, contractual SLA breaches, brand cost

The total CoQ on a representative $500K year-one AI project runs $200K to $250K; roughly 40 to 50 percent of total project spend, which is consistent with the manifesto’s claim that eval engineering is the dominant cost line. The composition of that spend is what determines whether the project ships clean or runs into the cost-of-quality death spiral.

Prevention cost: eval suite construction

Prevention is the largest AI quality bucket because the eval suite is the single artifact that determines whether everything downstream works. A typical mid-tier AI project’s prevention spend decomposes as:

  • Test set construction (8-12% of project cost): 200 to 2000 inputs with ground-truth labels, edge cases, hostile prompts, and representative production traffic. The harder the test set, the more prevention value per dollar.
  • Eval harness build (5-8%): Promptfoo, Inspect, or custom; wired into CI, with diff views, threshold gates, and historical trend tracking.
  • Threshold locking (2-4%): Defining what “passing” means for each eval; exact match, semantic similarity over X, judge-LLM rubric score over Y. Threshold-locking is high-leverage; under-tight thresholds let regressions through, over-tight thresholds block valid releases.
  • Prompt registry build (2-4%): Versioned prompt library with diff history, audit trail, and rollback. Prevents the prompt rot that surfaces as quiet quality decay.
  • Retrieval design and tuning (3-6%): For RAG systems, retrieval design is upstream of generation quality. A bad retriever produces a quality ceiling no amount of generation tuning can lift.

Total prevention: roughly 20 to 30 percent of project cost on a well-budgeted project. The eval-cost framework shows how prevention spend converts into per-action quality stability.

Under-invested prevention is the single most common AI project failure mode. A team that spends 8 percent of project cost on prevention typically pays 30 to 40 percent on external failure when the system meets production reality.

Appraisal cost: running evals continuously

Appraisal is the recurring cost of running prevention infrastructure. Three components:

  • Eval execution on most change (3-5% of project cost): Each PR triggers an eval run; eval cost is 50 to 200 per run depending on suite size and model cost. Across a year of development, this lands at 3 to 5 percent of project cost.
  • Model-upgrade re-evals (5-8%): Three to five frontier model upgrades per year, each triggering a full re-eval cycle. Re-eval cost includes both compute and engineering judgment time on triaging the regressions surfaced. The compounding nature of model-upgrade work is covered in the AI project compounding return.
  • Drift monitoring (2-3%): Production traffic sampled and continuously eval’d to detect drift between dev distribution and live distribution. Drift monitoring is what catches “the eval set says we’re at 92 percent, but production sees 78 percent”; the specific failure mode that caused 30 percent of the 2025 production incidents we audited.

Total appraisal: 10 to 15 percent of project cost on a project that takes evals seriously. Projects with no continuous appraisal pay the cost in the next bucket; internal or external failure; at a 5 to 10x multiple.

Internal failure cost: regression triage

Internal failure is what regression triage costs when caught before production. The work decomposes as:

  • Diagnosing why an eval test failed (4-6% of project cost): Senior judgment work; is it a prompt regression, retrieval regression, model behavior change, or test set drift? Each diagnosis requires reading the failing eval, comparing to passing baseline, hypothesizing the cause, and verifying.
  • Implementing the fix (3-5%): Prompt edit, retrieval re-tune, retrieval re-index, fallback logic, or eval test correction. Most fixes are 2 to 6 hours of work; some are 2 to 3 days.
  • Re-running and re-locking (1-2%): After the fix, re-run the full suite and verify nothing else moved. Re-lock the threshold if the change implies a new baseline.

Total internal failure: 10 to 15 percent of project cost on a healthy project. Higher on projects with large eval suites that catch many regressions; lower on small-suite projects (which look cheaper here but pay it back in external failure).

A useful diagnostic: the ratio of internal-to-external failure cost. A 5:1 internal-to-external ratio is the manufacturing-quality benchmark; most quality issues caught before customers see them. AI projects in 2026 typically run 0.3:1 to 0.8:1 because their internal failure infrastructure (eval coverage, threshold strictness) is too thin to catch regressions before production.

External failure cost: production incidents

External failure is the most expensive bucket and the one most projects under-account for. Five components:

  • Direct incident cost (2-4% of project cost): Engineering hours during the incident, hotfix deployment, post-mortem write-up. A typical AI incident in 2026 burns 20 to 60 engineering hours at an internal cost of 8K to 25K.
  • Customer-facing brand cost (3-8%): Each public hallucination, embarrassing answer, or compliance violation has a brand cost that scales with audience reach. The Air Canada chatbot incident, Bing’s early hallucination tour, and the AI agency case studies many show that external failure cost is non-linear in severity.
  • Contractual SLA cost (1-5%): B2B AI products with contracted accuracy SLAs pay credits or refunds when missed. The cost grows with customer count and contract size.
  • Sales velocity cost (2-6%): Public failures slow inbound demo velocity, lengthen sales cycles, and add procurement friction. The cost is real but rarely attributed to the originating quality miss.
  • Hidden re-work cost (3-6%): A production incident triggers an emergency eval-suite expansion, retrieval re-tune, or guardrail addition. The incident is the forcing function for work that should have been done in prevention.

Total external failure: 5 to 10 percent on a healthy AI project, 25 to 35 percent on a project that under-invested in prevention and appraisal. Projects in the second category typically discover the cost only after the first incident.

The optimal mix and the typical mix

The benchmark allocation for a healthy AI project of total CoQ:

BucketHealthy projectTypical 2026 project
Prevention50%25%
Appraisal20%10%
Internal failure20%15%
External failure10%50%

Healthy projects spend 70 percent of CoQ on prevention plus appraisal; typical 2026 projects spend 35 percent. The displaced spend lands in external failure; production incidents, brand cost, SLA breaches. The mathematics is straightforward: most dollar shifted from prevention to external failure produces 5 to 10x the cost because external failure is the most expensive bucket per unit defect.

The pattern holds across industries. Manufacturing learned it in the 1980s. Traditional software learned it in the 1990s. AI projects in 2026 are still learning it because the prevention infrastructure (evals, thresholds, registries) is new and under-budgeted at the planning stage.

Diagnostics: where your project sits

Three quick questions that locate any AI project on the CoQ map:

  1. What percent of project cost is the eval suite construction? Under 8 percent: under-prevented. 8-15 percent: at the bottom edge of healthy. 15-25 percent: well-budgeted. Over 30 percent: possibly over-built relative to project complexity.

  2. What is the ratio of internal-to-external failure incidents per quarter? Under 1:1: external failure is dominant; the eval suite is too thin to catch what’s escaping. 3:1 to 5:1: healthy. Over 8:1: the eval suite may be over-strict and blocking valid releases.

  3. What was the cost of the last production incident? Under 10K: well-handled. 10-50K: typical. Over 100K: external failure cost is high enough to justify shifting another 5 to 10 percent of project budget into prevention immediately.

Projects that sit poorly on these three diagnostics need the same intervention: shift budget from external failure (which is paid by the buyer in incident cost) into prevention (which is paid by the team in eval engineering). The shift is structurally cheaper because the prevention multiplier is 5 to 10x.

The anatomy of a runaway AI project shows how these CoQ imbalances cascade into the larger project failures that destroy budgets.

Frequently asked questions

What is cost-of-quality and how does it apply to AI projects?

Cost-of-quality (CoQ) is the total project spend organized around four buckets: prevention (preventing defects), appraisal (detecting them), internal failure (fixing them before customers see them), and external failure (paying for the ones that escape). The framework comes from manufacturing in the 1950s and applies cleanly to AI projects with one substitution: prevention in AI is eval suite construction and threshold locking, not unit testing. Total AI CoQ runs 40 to 50 percent of project cost; the composition of that spend determines whether the project ships clean.

What is the optimal CoQ mix for AI projects?

Roughly 50 percent prevention, 20 percent appraisal, 20 percent internal failure, 10 percent external failure. This is shifted toward prevention versus traditional software (which sits at roughly 40-25-20-15) because AI quality issues are harder to catch with code review and easier to catch with structured evals. Projects that achieve this mix typically run 5 to 10x lower external failure cost than the typical 2026 AI project.

How much should I budget for prevention specifically?

Prevention should run 20 to 30 percent of total project cost on a well-budgeted AI project. This decomposes into eval suite construction (8-12 percent), eval harness build (5-8 percent), threshold locking (2-4 percent), prompt registry build (2-4 percent), and retrieval design (3-6 percent). Projects that budget under 15 percent on prevention typically pay it back at 3 to 5x in external failure cost.

What is the typical AI project’s CoQ failure mode?

Under-invested prevention plus appraisal (typically 35 percent of CoQ versus the healthy 70 percent), which surfaces as dominant external failure cost (typically 50 percent of CoQ versus the healthy 10 percent). The displaced spend is moved from controllable, predictable buckets (prevention, appraisal) into uncontrollable, expensive ones (production incidents, brand cost). The fix is to shift budget into prevention at project kickoff, before the first incident teaches the lesson.

How does AI CoQ differ from traditional software CoQ?

Two structural differences. First, AI prevention is eval suites rather than unit tests because correctness in AI is statistical rather than deterministic. Second, AI appraisal must run continuously through model upgrades and prompt changes, not just at release boundaries. The four-bucket framework is identical; the magnitudes and the artifacts differ. Traditional software CoQ is roughly 25 percent of project cost; AI CoQ is roughly 40 to 50 percent.

Can I measure CoQ on a project that is already underway?

Yes. Decompose the trailing 90 days of project spend into the four buckets. Prevention spend is the eval suite, harness, threshold work, and prompt registry. Appraisal is the eval execution cost (compute plus engineering review time). Internal failure is regression triage hours. External failure is incident hours plus brand and contractual cost. The four-bucket sum is the trailing CoQ. Compare the mix against the healthy benchmark; the gap names the rebalancing work.

How do I make the case for higher prevention spend to a CFO?

Show the cost ratio. Most dollar spent on prevention saves roughly 5 to 10 dollars in external failure on the typical AI project; a 500 to 1000 percent ROI on prevention investment. CFOs who require ROI justification understand the multiplier instantly. The conversation that fails is “we need more eval engineers because evals are good.” The conversation that works is “we are spending 50 percent of our quality budget on production incidents at 8x the cost; shifting 10 points to prevention will reduce CoQ by 30 percent within six months.”

What’s the relationship between CoQ and the cost-per-action framework?

CoQ tells you how much of the project to spend on quality and how to allocate it across four buckets. The cost-per-action framework tells you what the per-unit economics look like once the system is in production. Both frameworks coexist: CoQ is the project-level quality budget, cost-per-action is the run-time unit cost. A well-budgeted project hits its cost-per-action target because its CoQ mix produced a system that works at predictable quality.

How does CoQ change across years one, two, and three of an AI project?

Prevention drops year over year because the eval suite is built once. Appraisal stays roughly flat because evals run continuously regardless of project age. Internal failure stays flat to slightly rising as the eval suite grows. External failure should drop as the project matures and prevention compounds. The full cost-curve shape across years is covered in the AI project cost curve.

Can a small AI project skip the four-bucket model?

No. The four buckets exist whether or not you label them; small projects that skip prevention pay external failure at the same multiplier as large projects, just on smaller absolute spend. A 50K AI project with 5 percent prevention spend is allocating 2.5K to prevention and is likely to pay 15-25K in external failure cost on the first incident. The four-bucket discipline scales down; what changes is the absolute dollar amounts, not the percentages.

Key takeaways

  • Cost-of-quality (CoQ) decomposes into four buckets: prevention, appraisal, internal failure, external failure. The framework is 60 years old and applies cleanly to AI work.
  • Healthy AI projects spend 70 percent of CoQ on prevention plus appraisal; typical 2026 projects spend 35 percent and pay the rest in external failure at a 5 to 10x multiplier.
  • Prevention in AI is eval suite construction, harness build, threshold locking, prompt registry, and retrieval design; roughly 20 to 30 percent of project cost on a well-budgeted project.
  • External failure cost is the most expensive bucket because it includes brand cost, SLA cost, sales velocity cost, and emergency rework; typically 5 to 10x the cost of catching the same defect in prevention.
  • The diagnostic is three questions: prevention as a percent of project cost, internal-to-external incident ratio, and last incident cost. Projects sitting poorly on any of the three need budget shifted into prevention at the next planning cycle.

CoQ is the framework that turns “evals are important” into “evals should be 25 percent of project cost and here’s why.” A 2026 AI project running on a 1990s software-CoQ template under-budgets prevention by half and pays the difference in production incidents. The four-bucket model is how to fix it before the first incident teaches the lesson.

Last Updated: Jun 12, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles