The CFO sign-off on a 2026 AI project sees a number that is structurally too small. Not because the agency lied, and not because the engineering team padded, but because the budget template the CFO is reviewing predates the seven cost lines that run a production AI system. The lines are not exotic. They are auditable, individually bounded, and collectively responsible for 40 to 60 percent of the real total cost of ownership. They simply do not appear in the legacy software TCO template because that template was built for a system that did not have evals, prompt registries, or model upgrades. This piece names the seven lines, sizes each, and shows how to budget them upfront.
This is a spoke under the AI project economics manifesto, which argues that AI projects need a new economics framework; not the feature-cost framework inherited from 2018 software. The seven TCO lines below are the practical residue of that argument: the specific budget categories a CFO needs to add to a 2026 RFP template before a single proposal arrives.
Why CFOs miss these lines
The 2018 software TCO template; engineering build, infrastructure, support contingency, license fees; was sufficient for systems whose behavior was deterministic and whose dependencies were stable. AI projects are neither. The system’s behavior shifts under model upgrades. Its inputs vary in distribution. Its quality is enforced by tests that themselves cost money to build and maintain. None of these lines existed as discipline in 2018, so none of them are in the budget templates most finance teams are working from in 2026.
The result is a predictable arc. The CFO approves a $X budget. Six to eight months in, the project surfaces a series of “scope items” that were not in the original SOW. The CFO sees a 25 to 40 percent overrun and either kills the project or funds the overrun while losing trust in the team. In nearly most case the overrun decomposes cleanly into the seven lines below. The lines were usually going to be spent. The budget did not name them.
What follows: the seven lines, each with a typical dollar range for a mid-tier project ($250K to $750K build), a one-paragraph explanation of why finance misses it, and the specific paperwork move that surfaces it upfront. As the hidden cost of AI evals piece argues separately, eval-related lines are the largest single category; but the seven below extend beyond evals to the operational substrate the system runs on.
Line 1: Eval test set construction
What it is. Curating 200 to 2000 representative inputs against the buyer’s actual workload distribution, with rubric-graded or ground-truth outputs, with domain experts in the annotation loop, with a versioning policy so the test set evolves rather than freezes.
Typical $ range. $20K to $60K on a $250K to $750K build. Domain-expert labor at $150 to $300 per hour, 80 to 200 hours of curation and review.
Why CFOs miss it. It looks like data work, which finance assumes is the buyer’s responsibility, or it looks like QA, which finance assumes is rolled into engineering. It is neither. It is the foundational artifact against which most subsequent engineering decision will be evaluated, and outsourcing it to scale-labelers without senior review reliably produces test sets that pass the agency’s eyeball check and fail to surface the failure modes production hits.
Budget upfront. Add a “test set construction” line to the budget template, owned by a named domain expert (internal or contracted), sized at 8 to 12 percent of build cost.
Line 2: Model-upgrade re-evaluation
What it is. Re-running the full eval suite when a frontier model provider; Anthropic, OpenAI, Google; ships a non-trivial upgrade. Triaging the typically 5 to 15 percent of test cases that shift. Adjusting prompts and retrieval to the new model’s behavior. Re-locking the threshold. Two to four engineering weeks per upgrade.
Typical $ range. $30K to $80K annualized on a $250K to $750K build. Three to five upgrades per year, two to four weeks each, senior engineering rates.
Why CFOs miss it. The 2018 budget assumed the runtime substrate (database, web framework) was stable across the support window. Frontier model providers ship breaking-quality upgrades multiple times per year, and the buyer’s choice is to either pay the re-eval cost or run a model that is no longer best-of-breed. There is no “freeze the model” path that does not eventually become a strategic liability.
Budget upfront. Add a model-upgrade re-eval clause to the maintenance retainer with a named SLA (typically 14 days from major release to re-eval report) and a re-eval reserve in the year-one budget. See long-term maintenance costs of AI systems for the cumulative annual decomposition.
Line 3: Regression triage time
What it is. Engineering judgment on red eval runs. Reading reasoning traces. Comparing against the last green run. Hypothesizing the cause (prompt change, retrieval change, model change, content drift). Validating the hypothesis with targeted tests. Deciding whether to fix forward, roll back, or accept the regression as a deliberate trade-off.
Typical $ range. $40K to $100K per year on a $250K to $750K build. One to two days per week of senior engineering time on a serious project, billed at $200 to $300 per hour blended.
Why CFOs miss it. It does not look like work in progress reports; there is no feature shipping at the decline of it. It looks like overhead. It is not overhead; it is the discipline that prevents the system from silently regressing while engineers ship new features against an unmonitored quality bar.
Budget upfront. Add a “regression triage” line to the headcount plan, sized at 10 to 14 percent of build cost annualized, with a named triage owner and a 48-hour SLA on red runs.
Line 4: Inference cost variance
What it is. Production token costs that vary unpredictably with usage growth, prompt length drift, retrieval-context bloat, and model-tier upgrades. The midpoint estimate is typically wrong by 2x in either direction over a 12-month window.
Typical $ range. $20K to $200K per year on a mid-tier project, depending on usage scale. The variance band is wider than the central estimate, which is the thing finance is least equipped to handle.
Why CFOs miss it. Finance budgets infrastructure as a fixed monthly run rate. Inference does not behave that way. A successful product will see usage grow 3 to 10x year-on-year; an unsuccessful product will see it stagnate or shrink. The same product can see per-unit inference cost drop 50 percent over 12 months as token prices fall, or rise 40 percent as the team adopts a higher-tier model for harder tasks.
Budget upfront. Budget inference as a band ($X to $Y), not a point estimate, with a quarterly review trigger that re-bases the run rate. Tie the band to a named usage projection and a named model tier; revise both quarterly.
Line 5: Prompt registry maintenance
What it is. A versioned, tested, observed catalog of most prompt the system uses in production. Each prompt has an ID, a version, a linked test set, a deployment record, and an audit trail. Maintenance is the engineering work of keeping the registry trustworthy: deprecating dead prompts, refactoring shared fragments, updating tests when prompts evolve, retiring prompts that fail re-eval.
Typical $ range. $15K to $40K per year on a project with 30 to 80 production prompts. Senior engineering time, intermittent rather than continuous.
Why CFOs miss it. The 2018 budget had no analog. Prompts feel like configuration, which finance assumes is free, but they behave like code that determines system quality. A registry that drifts becomes a security and quality liability; prompts get tweaked in production without audit, regressions appear without attribution, and the team loses ability to roll back to a known-good state.
Budget upfront. Add a “prompt registry” line under platform engineering, owned by an eval engineer or platform engineer, sized at 4 to 6 percent of build cost annualized.
Line 6: Observability storage costs
What it is. The infrastructure cost of storing the trace data, eval run history, prompt-and-completion logs, and metrics necessary to debug and improve a production AI system. Cardinality is high (per-request traces with reasoning steps), retention windows matter (regression debugging needs 90 to 180 days of history), and the storage substrate is non-trivial (vector indexes, columnar trace stores, long-term blob retention).
Typical $ range. $5K to $30K per year on a mid-tier project, scaling with traffic and retention policy. Storage rates are not the bottleneck; the bottleneck is unbounded retention without a tiering policy.
Why CFOs miss it. Standard application logging in 2018 was megabytes per month. AI observability is gigabytes to terabytes per month. The instinct to “log everything for debugging” produces a quietly compounding storage bill that nobody looks at until it is the second or third largest infra line. The day-one observability stack piece covers what to install; the storage cost is the bill that comes 60 days later.
Budget upfront. Add an “observability storage” line under infrastructure with a named retention policy (90 days hot, 180 days warm, 12 months cold) and a quarterly review of cardinality and volume against the original assumption.
Line 7: Post-launch eval retainer
What it is. A monthly retainer that covers ongoing eval suite maintenance, model-upgrade re-evals, regression remediation, and eval-bar progression as named activities rather than change orders.
Typical $ range. 25 to 40 percent of build cost annualized. On a $500K build, $125K to $200K per year. The retainer paradox piece argues this should be priced against eval outcomes rather than monthly hours, but the order of magnitude is the same either way.
Why CFOs miss it. Legacy software support contracts are 12 to 18 percent of build cost. AI maintenance retainers run 25 to 40 percent because the work is fundamentally different; it is not bug fixes against a frozen system, it is continuous quality enforcement against a system whose substrate (the model) is shifting underneath it.
Budget upfront. Add the retainer to the SOW from kickoff, with eval-named clauses (test set updates, harness maintenance, model-upgrade re-eval, regression remediation, eval-bar progression). Sized at 25 to 40 percent of build cost annualized for the first 24 months.
How to budget the seven lines upfront
Five practical moves.
One. Add many seven lines to the AI project budget template. Not as comments. As named line items with named owners and dollar ranges.
Two. Require the agency or internal team to produce a 24-month TCO projection at the proposal stage, decomposed into the seven lines. Reject proposals that show a single “engineering” line and refuse to decompose.
Three. Budget inference and observability storage as bands, not point estimates. Tie the bands to named usage and retention assumptions. Review quarterly.
Four. Make the maintenance retainer a kickoff artifact, not a post-launch negotiation. The retainer’s terms; what is included, what is excluded, how re-evals are triggered, what the SLA is on regressions; are easier to negotiate before the team has leverage of being mid-project.
Five. Run a TCO review at month nine of the build (three months before launch) to re-base the seven lines against actual usage, suite size, and infrastructure shape. The estimates produced at proposal stage will be 20 to 40 percent off in one direction or another. The month-nine review prevents the year-two budget from inheriting bad estimates.
The cost of doing many five is approximately the cost of the additional finance and project-management time to track them; call it 1 to 2 percent of build cost. The cost of not doing them is the recurring CFO experience of approving a $X budget and watching it become $1.4X without ever finding the missing $0.4X explained by anything other than “the work was bigger than we thought.” It was not bigger. It was the seven lines.
Frequently asked questions
Why doesn’t the standard TCO template capture these seven lines?
Because the template predates them. Standard enterprise software TCO templates were calibrated against systems where behavior was deterministic, the runtime substrate was stable, and quality assurance was a one-time pre-launch activity. AI projects break many three assumptions: behavior shifts under model upgrades, the substrate moves three to five times per year, and quality assurance is a continuous activity that decomposes into eval engineering. The seven lines are the categories the template needs added to be 2026-correct.
Are these seven lines additive on top of the build budget or part of it?
Three of the seven (test set construction, eval harness in line 1’s adjacent work, regression triage) are part of the build budget; they are work that happens during the engagement. Four (model-upgrade re-eval, inference, observability storage, post-launch retainer) are post-launch or annualized lines that recur after the build ends. The TCO is build cost plus annualized recurring lines, summed over a 24- to 36-month planning horizon.
How big is the total miss for a typical CFO?
On a $500K build, the seven lines sum to roughly $200K to $400K of additional spend over 24 months that legacy templates do not capture. The variance is wide because inference cost and observability storage scale with usage, and the eval-related lines scale with project ambition and frontier model release cadence.
Which of the seven is the biggest dollar item?
Post-launch eval retainer (line 7) is typically the largest single line at 25 to 40 percent of build cost annualized. Across a 24-month window, the retainer can equal or exceed the original build cost. Regression triage (line 3) is the second largest at 10 to 14 percent annualized.
Can a project skip the prompt registry to save money?
Possible at a small scale (under 10 prompts) and indefensible at any meaningful scale. Without a registry, the team loses ability to attribute regressions, audit prompt changes, or roll back to known-good states. The cost of skipping it is paid later as triage cost (line 3) goes up because root cause analysis takes longer without versioned prompts.
How does inference cost variance work in practice?
A team launches with an estimated $4K per month inference budget. Six months in, usage has tripled and a higher-tier model is used for harder tasks; the bill is $18K per month. Three months later, token prices have dropped 40 percent and a routing layer has cut average prompt length; the bill is $11K per month. The point estimate at launch was wrong by 4x in one direction and 3x in the other within nine months. Budget the band, not the point.
What if a CFO insists on a single fixed-price number?
The defensible answer is to fix-price the build (lines 1, 3 in part, plus the agency’s normal scope), index-price the recurring lines (lines 2, 4, 5, 6, 7) to named external inputs (frontier model releases, usage metrics, retention policy), and review quarterly. Fixing many seven into a single number forces the agency to either pad heavily or under-price and surface the gap as scope creep; neither outcome serves the CFO.
How does this connect to the AI project economics manifesto?
The seven lines are the operational decomposition of the manifesto’s argument that AI projects need an evaluation-cost framing rather than a feature-cost framing. The manifesto sets the principle; the seven lines are the budget categories it implies. CFOs who believe the principle but do not name the seven lines on the budget template end up with the same overrun as CFOs who rarely read the manifesto.
How does this differ from the eval-cost piece?
The hidden cost of AI evals decomposes the 30 to 40 percent of project cost that goes to eval engineering specifically. The seven lines here extend beyond eval engineering to the broader operational substrate (inference variance, observability storage, retainer structure, prompt registry). Roughly half the seven lines are eval-adjacent; the other half are operational. Both decompositions are needed for a complete TCO picture.
Key takeaways
- AI projects have seven recurring TCO lines that legacy software templates do not capture: test set construction, model-upgrade re-evaluation, regression triage, inference cost variance, prompt registry maintenance, observability storage, and post-launch eval retainer.
- These seven lines collectively run 40 to 60 percent of true 24-month TCO on a mid-tier ($250K to $750K) project; money that gets spent regardless and surfaces as scope creep when it is not budgeted upfront.
- Three of the seven (test set, regression triage, prompt registry) are during-build lines; four (re-eval, inference, observability, retainer) are post-launch or annualized lines. Both categories need to be in the proposal stage TCO projection.
- The fix is paperwork: add the seven lines to the budget template, require decomposed proposals, budget volatile lines as bands, make the retainer a kickoff artifact, run a month-nine re-base review.
- The seven lines are the operational residue of the manifesto’s economics framing. CFOs who name them upfront recover the trust that gets lost when month-eight scope items appear without explanation.
The seven lines are not exotic. They are not novel. They are not the agency trying to charge for something invented. They are the budget categories a 2026 production AI system runs on, named in advance instead of named after the fact. Naming them upfront is the cheapest single thing a CFO can do to keep an AI project budget honest.
Arthur Wandzel