The AI Project Economics Manifesto: From Feature Cost to Evaluation Cost

The economics of an AI project do not work the way the economics of a CRUD project works. The unit of account is no longer the feature; it is the evaluation. Cost stops being a function of how many engineering hours you spent and starts being a function of which eval thresholds you cleared, how many model upgrades you re-evaluated through, and how much observability you paid for. Most finance, procurement, and engineering org budgeting AI work in 2026 against a 2018 software template is mispricing its own roadmap by a factor that compounds quarter over quarter.

That is the entire thesis. The rest of this manifesto is what AI project economics look like; eight principles, written down, numbered, and signed. Each is deliberately specific enough to quote back to a CFO during a budget cycle.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Why this manifesto exists

Buyers planning AI work in 2026 are running into a recurring failure mode: the budget approves, the engineering ships, and roughly nine months in the project starts hemorrhaging money on line items the original budget did not name. Eval engineering. Regression triage. Model-upgrade re-tests. Observability traces. Post-launch maintenance retainers. None of these were “the feature.” Many of them are now half the cost.

The reflex is to call it scope creep and renegotiate. That reflex is wrong. The cost is not creeping; it is structural. The legacy budget template did not have categories for it because the legacy template was built for software where the unit of work is a feature and the unit of completion is “it works.” AI software has a different unit of work; the evaluation; and a different unit of completion; “it passes the eval bar at a defensible threshold.” Budgeting AI work as if it were CRUD work is the source of the surprise.

This manifesto names the eight principles a 2026 finance org must internalize to budget AI projects without quarterly surprises. It is the anchor of our Pillar 2 work on AI project economics and pairs with the Pillar 1 AI agency manifesto on operating models. The two are joined at the contract: a 2026 operating model produces 2026 economics, and vice versa.

The eight principles

1. Evaluation is the unit of account

In legacy software economics the unit of account is the feature. You scope features, estimate features, ship features, accept features, bill features. The whole apparatus; Jira, story points, SOWs, change orders; assumes the feature is the atomic deliverable.

In AI software the unit of account is the evaluation. An AI feature without an eval is not a feature; it is an unbounded liability. Two separate engineering organizations can ship “the same feature”; the same prompt, the same retrieval, the same tool calls; and produce systems whose accuracy on the buyer’s actual workload differs by 25 points. The feature description does not contain the information that distinguishes them. The eval does.

What changes for budgeting:

Line items are not “build the agent.” They are “agent passing eval-set v1 at >= 0.82 weighted score on the 240-prompt enterprise test set.”
Acceptance is not “the agent demos in QBR.” It is the eval report attached to the invoice.
Progress is not story points burned down. It is the curve of eval score over time, plotted against the locked threshold.

What an organization needs to encode: most AI line item in the budget references an eval set by name and version. If the line item cannot, the line item is mispriced.

2. Eval-pass thresholds drive billing, not feature lists

The natural follow-on to principle 1: the contractual milestone is hitting an eval threshold, not delivering a feature list. This sounds semantic and is not.

A feature list is a static description of what got built. An eval threshold is a dynamic measurement of how well it works against the workload it was paid to do. The first is auditable by reading the PR. The second is auditable by running the eval suite. The first can ship and still be useless. The second cannot.

OpenAI’s public Evals framework, Anthropic’s Claude evaluations tooling, and open-source projects like Promptfoo and Inspect have many matured to the point where eval thresholds are operationally cheap to specify, run, and audit. There is no longer a tooling excuse for billing AI work against feature lists.

What changes:

30 percent of contract value is held back against eval-threshold milestones, not feature acceptance.
Each milestone names a specific threshold on a specific eval set version. “Pass” is the eval report; “fail” triggers structured remediation, not change orders.
The buyer gets read access to the eval suite from day one. Not “delivered at the end”; read access from kickoff. The eval suite is jointly owned scope, not agency IP.

What an org needs to encode: procurement language that defines deliverables in eval-threshold terms. Legal teams used to drafting feature-acceptance clauses will resist. The cost of resisting is the cost of the next 18 months of AI projects.

3. Model-upgrade re-evaluation is budgeted, not absorbed

Frontier models update on roughly a six-to-twelve-week cadence. Anthropic, OpenAI, Google, Meta; most major lab is shipping non-trivial model upgrades inside any project’s lifecycle. Pretending this is a footnote is the single most expensive line item in the unbudgeted AI project.

A model upgrade is not free. It requires re-running the full eval suite, triaging regressions, adjusting prompts and retrieval to the new model’s behavior, re-locking the threshold. On a serious project this is two to four engineering weeks per upgrade. Across a 12-month engagement that is eight to sixteen weeks of work that is structurally invisible to a 2018 SOW.

What changes:

Budgets allocate explicit “model upgrade re-eval” capacity. Three to five upgrades per year is the planning horizon as of 2026.
The retainer or maintenance contract names re-eval as a covered activity. It is not a line item the agency ambushes the buyer with.
The buyer’s CFO knows in advance that “Claude 4.7 → 4.8” or “GPT-5 → 5.1” is a budget event, not a press-release event.

What an org needs to encode: a percentage of annualized AI ops spend (we use 8–12 percent) earmarked as model-upgrade reserve. If the model cadence slows, the reserve rolls forward. If it accelerates, you are not under-budgeted. We discuss the operational version of this in the AI agency tax piece; the engagement-level tax that compounds when this reserve is missing.

4. Observability is COGS, not OpEx

In legacy SaaS economics, observability; logs, traces, dashboards; is overhead, charged to OpEx, sized as a small percentage of engineering. The feature shipped; observability is the rounding error that helps you sleep.

In AI economics, observability is cost-of-goods-sold. It is not optional and it is not small.

The reason is structural: an AI feature with no observability is uncompletable. You cannot tell whether it is regressing. You cannot triage which user query class drives the failures. You cannot defend the eval threshold against the next quarterly board review. The feature exists; the AI feature is the feature plus the observability that proves it is still working. Strip the observability and you do not have a cheaper feature; you have an unmeasurable liability that will eventually produce a customer-visible failure nobody can diagnose.

In practice this means:

Trace storage, replay tooling, and online eval scoring are line items in the build budget, not an afterthought.
The maintenance retainer pays for someone to read the traces, not just store them.
The unit cost per eval token (input + output + reasoning) is forecasted as part of unit economics, exactly the way infra cost is forecasted in a SaaS gross margin model.

What an org needs to encode: observability as a fixed percentage of inference spend (we use 15–25 percent) plus a one-time build line. Not “we’ll add Datadog later.”

5. Eval engineering is 30 to 40 percent of the project, named upfront

This is the principle finance teams find most surprising and which we have the most direct data on. Across mature AI engineering shops in 2026, eval work runs 30–40 percent of total project cost. That is not a tail; it is roughly equal to the cost of the application engineering itself.

The work decomposes into four sub-budgets:

Test set construction. Curating 200–2000 representative inputs with ground-truth or rubric-graded outputs. Domain-specific. Cannot be outsourced to scale-labelers without senior review. Roughly 8–12 percent of project cost.
Eval harness build. The infrastructure that runs the suite on most PR, most model swap, most prompt change. Roughly 5–8 percent.
Regression triage. When the suite goes red, someone has to determine why and decide. Roughly 10–14 percent.
Model-upgrade re-eval. Principle 3, billed in. Roughly 6–10 percent.

If your AI project budget has these four lines invisible; folded into “engineering”; your project is mispriced by 30 percent and the surprise will arrive between months four and eight. We unpack the decomposition further in the hidden cost of AI evals piece.

What an org needs to encode: a separate “eval engineering” budget owner, distinct from feature engineering, with its own milestones, hires, and tooling line.

6. Inference is a pass-through line, not agency margin

Token spend at scale is the fastest-growing line item on most production AI systems. It is also the one most often hidden by agencies billing flat fees that include inference.

The honest billing structure is direct: the buyer’s Anthropic or OpenAI or Google account, the buyer’s bill, the agency’s job is to make that bill predictable and shrinking over time. When the agency holds the API keys and rolls inference into a flat fee, three things happen, many bad. The buyer loses visibility into the actual unit economics of their own product. The agency develops a margin that grows when costs fall, which removes their incentive to optimize. And the buyer cannot price their own product accurately because they do not see the variable cost line.

What changes:

Inference is a pass-through line item with zero markup. Visible. Audited. Owned by the buyer.
The agency’s value capture is on engineering hours, eval engineering, and observability; work that scales with judgment, not with inference volume.
Cost-per-completion is reported on the same dashboard as eval score. The two are co-optimized; trading 10 percent of eval score for 60 percent of cost is an explicit decision, not a hidden one.

What an org needs to encode: a procurement clause forbidding marked-up inference. Pair with a unit-economics dashboard so the CFO and the head of engineering see the same number.

7. Maintenance is a retainer, sized to regression rate

A 2018 software project ends at launch and goes into “support”; a small percentage of the build budget covering bugs and incremental features. A 2026 AI project does not end at launch. Launch is roughly the midpoint of the cost curve, not the end.

The reason: regression rate. AI systems regress in ways legacy software does not. The model upgrades. The user query distribution shifts. A new product line gets added and the retrieval index goes stale. A vendor deprecates an embedding model. The eval suite drifts because the test set gets stale. Each event is an engineering ticket that has to be triaged, root-caused, and fixed against the eval bar; not absorbed into a quarterly support budget.

What changes:

The maintenance retainer is sized as a percentage of build cost; typically 25–40 percent annualized, depending on regression rate.
The retainer covers a named SLA: eval suite re-runs weekly, regressions triaged within 48 hours, model-upgrade re-evals within two weeks of a major release.
The retainer is billed monthly with usage transparency, not as a flat insurance premium.

What an org needs to encode: maintenance as a multi-year P&L line, not a one-time contingency. The CFO who modeled the project as build-then-decay will be wrong. The CFO who modeled it as build-plus-retainer will be right.

8. Payback is staged, not single-gate

The traditional ROI question is “when does this project pay back?” with the answer expected as a single number; six months, twelve months, eighteen months; beyond which the project is killed.

This is the wrong shape. AI projects pay back across at least three time horizons that compound differently:

The 90-day eval-feedback gate. Did we build something whose eval curve is rising and whose unit cost is falling? If yes, the project has earned its second 90 days. If no, kill or restart.
The 12-month capability gate. Did the system reach a deployable threshold against the production workload, with observability and a maintenance retainer? If yes, the project has earned its compounding investments. If no, descope hard.
The 24-month compounding gate. Did the eval library, prompt registry, agent skills, and observability harness become assets that the next AI project bootstraps from? If yes, you have built a platform; your next project’s marginal cost is half of this one’s. If no, you have built a feature, and your next project starts from zero.

A 6-month payback rule sounds disciplined and is destructive. It rules out exactly the projects that compound. We argue this case in detail in the payback paradox spoke.

What changes:

The portfolio is structured around three gates, not one. Each gate has explicit criteria and named killers.
Compounding investments; eval libraries, prompt registries, skills; are budgeted as platform, not as project. They have their own owner and own P&L.
The CFO who killed projects at month 7 because they had not paid back stops being a hero and starts being the bottleneck.

What an org needs to encode: a staged-payback portfolio review, run quarterly, with three gates and three associated kill criteria. We expand the model in our ROI calculator critique.

What changes for finance, procurement, and engineering

Walk the eight principles into the three functions that move when an organization re-budgets:

Finance. The budget template gets new line items: eval engineering, observability as COGS, model-upgrade re-eval, staged-payback gates, multi-year retainers. The 2018 template; engineering + infra + small contingency; becomes the wrong instrument for AI work. Finance teams that update the template before procurement teams update the contracts ship clean. Finance teams that wait for procurement to lead live in change-order hell.

Procurement. SOW language shifts from feature-list deliverables to eval-threshold milestones. Pass-through inference clauses go in by default. Maintenance retainer language is drafted once and reused, with SLAs around eval freshness and regression triage. Buyers running 2018 procurement language against 2026 AI work pay roughly the agency tax decomposed in our coordination cost piece; paying for misalignment instead of for software.

Engineering. Eval engineering becomes a discipline, not a sprint subtask. A senior engineer owns the eval suite the way a senior owns the build pipeline. Observability is built in week one, not month nine. Model-upgrade rehearsals are calendared, not surprises. Engineering leaders who insist on the discipline get a clean cost curve. Engineering leaders who let it slide produce a budget surprise their CFO will not forgive twice.

The pattern across many three: AI project economics are knowable. Most line item in this manifesto can be costed, governed, and audited. The only reason a project produces budget surprises is that the organization budgeted it as if it were 2018 software. The principles above replace the template.

Frequently asked questions

What is “evaluation cost” and how does it differ from feature cost?

Evaluation cost is the total project cost organized around the unit of evaluation: the test set, the eval harness, the threshold-locking process, the regression triage, and the re-evaluation against new models. Feature cost is the cost organized around shipping features: scoping, building, accepting. The two are not interchangeable. An AI project budgeted by feature cost typically under-budgets eval cost by 30–40 percent of total spend, which shows up as scope creep between months four and eight. Budgeting by evaluation cost surfaces the work upfront.

Why do you say observability is COGS instead of OpEx?

Because in AI software, observability is structurally required to know whether the product is working. A feature you cannot measure is not a feature you have shipped; it is a liability. Treating observability as overhead; a small OpEx percentage charged to engineering; produces under-instrumented systems whose first regression is invisible until a customer reports it. Treating observability as cost-of-goods-sold (15–25 percent of inference spend, plus a one-time build line) produces systems whose health is visible on the same dashboard as their unit economics. The distinction matters because COGS is governed; OpEx is rounded.

How much of an AI project budget goes to eval engineering?

Across mature AI engineering shops in 2026, eval engineering runs 30–40 percent of total project cost, decomposed into test set construction (8–12%), harness build (5–8%), regression triage (10–14%), and model-upgrade re-eval (6–10%). The number surprises finance teams used to budgeting evals as 5 percent QA. It should not. Eval work is where the project’s correctness is produced; under-budgeting it just moves the cost to month eight under the label “scope creep.” Public eval-tooling cost data from Promptfoo, Inspect, and the OpenAI evals framework many corroborate the order of magnitude.

How often do model upgrades require re-evaluation work?

Three to five times per year as of 2026. Anthropic, OpenAI, and Google many ship non-trivial model upgrades on roughly a quarterly cadence, with minor versions in between. A meaningful upgrade; Claude 4.x → 4.7, GPT-5 → 5.1, Gemini 3.x → 4; typically requires re-running the full eval suite, triaging 5–15 percent regression rate, adjusting prompts and retrieval, and re-locking the threshold. This is two to four engineering weeks per upgrade. A 12-month engagement that ignores this lives 8–16 weeks under-budgeted.

Why is fixed-bid pricing dangerous for AI work?

Fixed-bid pricing assumes scope is decided up front and can be defended against discovered reality. AI scope cannot. Model choice, retrieval design, eval bar, latency budget, and failure-mode taxonomy many shift as the project meets real workload. Fixed-bid pushes the agency to defend the original scope against this discovered reality, which converts scope discovery into scope-creep disputes; the most expensive coordination cost. Eval-threshold pricing; where milestones tie to specific eval thresholds, not feature lists; is structurally compatible with how AI work progresses.

Should we hold back budget against eval thresholds?

Yes. Roughly 30 percent of contract value held back against eval-threshold milestones is the structure that aligns agency and buyer incentives on AI work. The thresholds and eval-set versions are named in the contract; “pass” is the eval report; “fail” triggers structured remediation. The 30 percent number is conservative; high-stakes engagements run 40 percent. Less than 20 percent and the holdback is ceremonial; the agency does not feel the threshold.

What is the difference between a 90-day, 12-month, and 24-month payback gate?

The 90-day gate asks: is the eval curve rising, is the unit cost falling, is the system on a trajectory worth the next 90 days? The 12-month gate asks: did the system clear a deployable threshold with observability and a retainer in place, or are we shipping an unmeasurable liability? The 24-month gate asks: did we build a platform; eval library, prompt registry, agent skills, observability harness; that the next AI project bootstraps from, or did we build a one-off feature whose successor starts from zero? Each gate has different kill criteria, and the projects that compound are the ones that survive many three.

Does this manifesto apply to internal AI teams as well as external agency engagements?

Yes, almost identically. Internal teams pay the same eval engineering cost, the same observability-as-COGS cost, the same model-upgrade re-eval cost; they just pay it in headcount and calendar time rather than billable invoices, which makes the cost less visible and harder to govern. The eight principles are operating principles, not contracting principles. The hardest part for internal teams is that they cannot renegotiate their own budget template the way an SOW can be renegotiated; they have to win the argument with their own CFO.

Where does this manifesto fit relative to your AI agency manifesto?

The AI agency manifesto describes the operating model; what a 2026 AI development partner owes its buyer, in eleven commitments. This manifesto describes the economics that operating model produces. They are joined at the contract: the operating model defines the work; the economics price it. A 2018 contract running on a 2026 operating model still mis-prices. A 2026 contract running on a 2018 operating model still over-coordinates. Both have to update for AI project budgets to stop producing surprises.

Key takeaways

The unit of account in AI project economics is the evaluation, not the feature. Budgets that do not name eval thresholds are mispricing the work.
Eval engineering is 30–40 percent of project cost, decomposed into test sets, harness, regression triage, and model-upgrade re-eval. If those four lines are invisible, the project is under-budgeted.
Observability is COGS, not OpEx. Sized as 15–25 percent of inference spend plus a build line.
Model upgrades are budget events, not press-release events. Reserve 8–12 percent of annualized ops spend for re-evaluation.
Inference is a pass-through line, billed direct from the model provider to the buyer, with zero agency markup.
Maintenance is a retainer (25–40 percent of build cost annualized) sized to regression rate, with named SLAs on eval freshness.
Payback is staged across 90-day, 12-month, and 24-month gates. A single 6-month payback rule destroys exactly the projects that compound.
The legacy 2018 software budget template is the wrong instrument for 2026 AI work. Updating the template is finance’s job; updating the SOW is procurement’s; updating the discipline is engineering’s. Many three move together or none of them work.

The economics of AI projects are knowable. The only reason they produce surprises is that organizations are still budgeting them with last decade’s template.

The AI Project Economics Manifesto: From Feature Cost to Evaluation Cost

Decision Scope

Why this manifesto exists

The eight principles

1. Evaluation is the unit of account

2. Eval-pass thresholds drive billing, not feature lists

3. Model-upgrade re-evaluation is budgeted, not absorbed

4. Observability is COGS, not OpEx

5. Eval engineering is 30 to 40 percent of the project, named upfront

6. Inference is a pass-through line, not agency margin

7. Maintenance is a retainer, sized to regression rate

8. Payback is staged, not single-gate

What changes for finance, procurement, and engineering

Frequently asked questions

What is “evaluation cost” and how does it differ from feature cost?

Why do you say observability is COGS instead of OpEx?

How much of an AI project budget goes to eval engineering?

How often do model upgrades require re-evaluation work?

Why is fixed-bid pricing dangerous for AI work?

Should we hold back budget against eval thresholds?

What is the difference between a 90-day, 12-month, and 24-month payback gate?

Does this manifesto apply to internal AI teams as well as external agency engagements?

Where does this manifesto fit relative to your AI agency manifesto?

Key takeaways

See how companies like yours are using AI

Related articles

The 10x Developer Used to Be a Unicorn — Now We're Approaching the 1000x Paradigm

A field guide to evaluating an AI agency in under 90 minutes

Agentic AI Development: Tool Use and Function Calling

Where ideas become AI products

Company

General

Case Studies

Services

Resources