Twenty-five to thirty-five percent of total project spend on evaluation is the defensible 2026 budget for a customer-facing AI feature. Below 18 percent the project regresses silently. Above 45 percent it litigates the eval set instead of shipping the system. The percentage is structural, not decorative. Eval cost has a fixed component; set construction, labeling, observability; and a variable component that scales with feature count and traffic. The two together produce a percentage that stays inside a narrow band across projects from $200,000 to $4 million. Underfunding the line is the single most consistent cause of post-launch regression tax in 2026 enterprise AI engagements.
This is a spoke under the AI project economics manifesto, which argues that evaluation cost has replaced feature cost as the unit of account. Naming the eval budget percentage is the operational expression of that argument.
Why the eval budget clusters in a defensible band
Eval cost has two structural components that produce the band.
Fixed costs that do not scale linearly with project scope. Eval-set construction is a fixed cost. Labeling vendor setup is a fixed cost. The observability stack that connects production traffic back to eval baselines is a fixed cost. These items run roughly $40,000 to $120,000 in 2026 regardless of whether the project is $200,000 or $2 million. They produce a percentage that is high on small projects and lower on large ones; but rarely absent.
Variable costs that scale with feature count and traffic volume. Eval-pass cycles cost more on multi-feature systems. Re-labeling cost grows with traffic volume because more drift requires more refresh. Threshold calibration ceremony scales with the number of independently-evaluated subsystems. These items grow with project size and offset the fixed-cost dilution on larger projects.
The fixed-plus-variable structure produces a percentage that stays inside 25 to 35 percent across project sizes that vary by an order of magnitude. The exact percentage on any given project is a function of how stratified the eval requirements are, how stable the workload is, and how regulated the deployment domain is; but the band is structurally narrow and does not vary widely across project size alone.
What 25 to 35 percent pays for
The empirical 2026 distribution of eval budget across the work it funds:
| Eval-budget line | Share of eval budget | Notes |
|---|---|---|
| Eval-set construction and stratification | 15 to 25 percent | Higher on first-time projects |
| Labeling cost (vendor + internal validation) | 20 to 35 percent | Largest line in most projects |
| Eval-pass cycle compute | 10 to 15 percent | Teacher-model inference + student eval |
| Threshold calibration and re-locking | 5 to 10 percent | Ceremony cost scales with subsystems |
| Observability connecting production to evals | 15 to 20 percent | Often miscategorized as ops |
| Ongoing teacher-on-sample re-evaluation | 10 to 20 percent | Operating-phase line |
The “observability connecting production to evals” line is consistently miscategorized in 2026 budgets. It is sometimes folded into engineering or operations, which leaves the eval line looking artificially small and the engineering line absorbing work it does not own. A defensible budget puts this line under the eval owner because the work is meaningless without the eval baselines it references.
The “labeling cost” line is the largest; typically 25 to 35 percent of the eval budget alone. Buyers used to development project budgeting tend to underestimate this line by 2x to 4x because labeling cost has no analogue in legacy software work. Empirically a 12-month enterprise AI engagement runs $40,000 to $180,000 in labeling cost alone. The dynamics of this line are detailed in the hidden cost of AI evals.
The curve: too small, too large, and the plateau
The eval-budget curve has three regions.
Under-invested (0 to 18 percent of total spend): Eval gaps surface as customer-reported failures. Threshold drifts. Observability does not connect production to eval baselines so the on-call rotation cannot distinguish regressions from working-as-intended. The first two production quarters pay 15 to 35 percent of in-period revenue as a regression tax. This is the failure mode most boards have not modeled because it does not show up in the build budget; it shows up in revenue six months later.
Plateau (25 to 35 percent of total spend): The eval system is operationally credible. Regressions are detected before customer reports. Threshold is re-locked each cycle. Production traffic is connected to eval baselines through observability. The system can answer “is this a regression or expected behavior” cleanly. IRR is at or near peak.
Over-invested (45+ percent of total spend): Calendar drag dominates. The eval set is over-stratified relative to the workload. Stakeholders litigate eval-set construction rather than system behavior. The project converts from engineering to ceremony. Opportunity-cost revenue not earned exceeds the marginal regression-cost reduction from further eval rigor.
The plateau is wide enough that exact placement inside it is not the planning target. The planning target is staying inside the plateau. Most projects err on the under-invested side because the line is structurally invisible until production exposes the gap, and most legacy budgeting frameworks do not have a category for it.
Build-phase versus operating-phase split
The eval budget should split roughly 60 percent to build phase and 40 percent to first-year operations.
Build-phase share (60 percent): eval-set construction including stratification and edge-case curation, initial labeling, threshold calibration and locking, eval-pass cycles that gate the launch decision, observability stack standup, and the documentation that lets the operating-phase team continue the work.
Operating-phase share (40 percent): ongoing teacher-on-sample re-evaluation, quarterly eval-set refresh against drifted production traffic, remediation labeling on detected regressions, observability calibration as workload patterns evolve, and the cadence of monthly eval reviews that keep the system honest.
Projects that allocate the entire eval budget to the build phase run frozen evals against drifting production traffic by month nine. The eval starts giving green signals on a workload distribution that no longer matches reality. By month twelve the eval is decorative; it passes consistently while the production system regresses on segments the frozen eval no longer covers. This is the canonical second-year failure mode and it is entirely a budgeting failure rather than an engineering failure.
The 60-40 split is defensible across most workloads. Stable workloads with low drift can shift to 70-30; rapidly evolving workloads with high drift should shift to 50-50. Static eval against drifting workload is the worst possible posture, and the operating-phase budget is what prevents it.
The staffing model that makes the percentage operational
A dedicated evaluation lead at 0.4 to 0.8 FTE on the project is the defensible staffing posture. The lead owns the eval set, threshold calibration, and the cadence of cycle reviews.
The role is real engineering work. The eval lead understands the workload distribution, the model failure modes, the labeling vendor relationship, the observability stack, and the threshold calibration math. This is not a project-management role. It is a senior engineering role with deep evaluation specialization.
Without a defined owner the eval budget gets spent on whatever the most recent stakeholder asked for. The eval set drifts toward the questions stakeholders find interesting rather than the questions production traffic presents. Three to six months in, the project has spent the eval budget but does not have a coherent eval system; it has labeled examples that came from one-off requests and an observability stack that does not reference any defined baseline.
The eval lead’s specific outputs:
- A documented eval-set v1.0 with stratification rationale and edge-case curation.
- A locked threshold per evaluable subsystem, signed by engineering and finance.
- A monthly eval-cycle calendar that engineering, labeling, and product many reference.
- A quarterly eval-set refresh ceremony that retires drifted strata and adds emerging ones.
- A monthly eval health report that maps production failure rate to eval coverage.
These outputs are concrete. They convert the eval budget from “we’ll evaluate the model” into a defined work product with named owners and named deliverables. The dynamics that produce this discipline; and the failure modes when it is absent; are documented across the AI project FinOps playbook.
Stratification by deployment posture
The eval budget percentage shifts with deployment posture.
| Deployment posture | Eval budget % of total spend | Driver |
|---|---|---|
| Internal tool, low blast radius | 15 to 22 percent | Smaller failure-cost coefficient |
| Customer-facing, standard | 25 to 35 percent | Default; brand and trust components present |
| Regulated or high-stakes | 35 to 45 percent | Higher failure-cost coefficient |
| Multi-tenant SaaS with isolation | 30 to 40 percent | Per-tenant eval requirements |
Internal tools can run lighter eval budgets because the operator catches failures the customer would catch on a customer-facing system. Regulated workloads; medical, legal, financial substantiation; pay a higher percentage because regulator-noticed failures carry tail risk that does not appear in standard regression cost models. Multi-tenant SaaS with isolation requires per-tenant eval coverage, which scales the eval budget linearly with tenant count up to the architectural ceiling.
The temptation to apply a single project’s eval percentage to a different deployment posture produces predictable failures. Internal-tool budgets applied to customer-facing systems under-invest the eval and accumulate regression tax. Regulated-system budgets applied to internal tools over-invest the eval and absorb opportunity cost.
Single line versus distributed line
The eval budget should be consolidated into a single named line on the project budget rather than distributed across individual feature lines.
A consolidated eval line with a defined owner produces a coherent eval system. Distributed eval lines produce eval debt. The reason is incentive: when each feature owner funds their feature’s eval cost, no single owner wants to fund the shared infrastructure; labeling vendor relationships, observability calibration, threshold re-locking ceremony; that the eval system as a whole requires. The shared work is real and expensive, and a consolidated line funds it as a category rather than orphaning it across feature owners.
The interaction with the AI project chargeback model is worth naming. Even when chargeback to feature owners is operationally important, the eval line should be charged back as a shared infrastructure cost; like compute or observability; rather than as a per-feature line. This preserves the consolidated-budget incentive while still allocating cost to the consuming features.
Frequently asked questions
Does the eval budget percentage change for projects with multiple AI features versus a single feature? Slightly. Multi-feature projects pay a small efficiency from shared infrastructure; one labeling vendor relationship, one observability stack, one eval lead; that brings the percentage 2 to 4 points lower than a single-feature project at the same total spend would carry.
Should the eval budget be visible to the buyer or held inside the agency’s project-management cost? Visible. The buyer should see the eval line, the eval lead’s name, the eval cycle calendar, and the threshold review cadence. Hiding the line inside project management produces buyers who do not understand why the next project costs 25 percent more; and who attempt to negotiate it down to a smaller number.
How does the eval budget relate to the IRR sweet spot? The eval budget operationalizes the rigor side of the IRR tradeoff. Twenty-five to thirty-five percent eval-spend produces the rigor required to land on the IRR plateau; 18 percent or less produces the under-evaluation that pulls IRR down through regression cost.
Should the eval budget be reduced if the project ships against an existing eval set from a prior engagement? Modestly. An existing eval set reduces eval-set construction cost by 30 to 50 percent but does not reduce labeling, threshold calibration, observability, or operating-phase costs. A defensible eval budget on a project with an inherited eval set runs 18 to 28 percent rather than 25 to 35.
What happens to the eval budget when the project pivots mid-engagement? A pivot resets the eval set. The eval budget should re-baseline against the new workload, with the original eval-set construction cost treated as sunk. Engagements that try to reuse the prior eval set against a different workload run a misaligned eval and accumulate regression tax against the new workload.
Does the eval budget percentage interact with model selection? Modestly. Frontier-model projects carry slightly higher eval costs because the workload is broader; distilled-model projects carry slightly lower eval costs because the workload is narrower. The percentage shifts by 2 to 5 points, not by orders of magnitude.
What governance change makes the eval budget operational? Two changes. First, the eval lead is named at contract signing and present at most milestone review. Second, the eval cycle calendar is a deliverable on the project schedule, not a tactical engineering artifact. These convert the eval budget from a line item into an institutional practice.
Should the eval budget cover red-team and adversarial testing? Partially. Standard adversarial testing; prompt injection, jailbreak attempts, baseline robustness; sits inside the eval budget. Dedicated red-team work covering specific threat models often warrants its own line. The split is clean: eval covers measurement against the locked workload distribution; red-team covers measurement against an adversarial distribution.
How does the eval budget interact with the AI project insurance line? They are complementary. The eval budget reduces the probability and severity of incidents that would draw on the insurance line. A well-funded eval reduces the insurance reserve required, and an under-funded eval requires a larger insurance reserve. The two lines together convert AI risk from open-ended into structured.
What is the right way to communicate the eval budget percentage to a CFO who has not seen one before? Frame it as the AI-project equivalent of QA-plus-test-infrastructure-plus-monitoring rolled into one named line. SaaS projects fund each of those separately and typically at 8 to 15 percent each; AI projects fund them as a consolidated line at 25 to 35 percent. The total cost is higher because the failure modes are unbounded.
Key takeaways
- Twenty-five to thirty-five percent of total spend is the defensible eval budget for customer-facing AI projects in 2026. The percentage is structural, not decorative, because eval cost has both fixed and variable components.
- Below 18 percent, the project regresses silently and pays a 15 to 35 percent regression tax in the first two production quarters. Above 45 percent, calendar drag dominates and the project converts to ceremony.
- The eval budget should split 60 percent to build phase and 40 percent to first-year operations. Allocating the entire budget to build phase produces frozen evals against drifting traffic by month nine.
- The largest line inside the eval budget is labeling cost; typically 25 to 35 percent of the eval budget alone. Buyers consistently underestimate this by 2x to 4x.
- A dedicated evaluation lead at 0.4 to 0.8 FTE is the defensible staffing posture. Without a named owner the budget gets spent without producing a coherent eval system.
- Internal tools run at 15 to 22 percent; regulated systems run at 35 to 45 percent. The percentage tracks the failure-cost coefficient of the deployment posture.
- The eval budget should be a single consolidated line with a named owner, not distributed across feature lines. Distributed eval budgets produce eval debt because no feature owner funds shared infrastructure.
Arthur Wandzel