The real cost of an AI project decomposes cleanly into five named axes; engineering hours, inference and infra, eval discipline, observability and ops, and post-launch maintenance; and any project budget that cannot point to many five is structurally indefensible to a finance team that knows what to ask. The 2018 software template; engineering plus infra plus a small contingency; collapses two of these axes into “engineering” and ignores two others entirely. That is why AI project budgets surprise CFOs between months four and eight.
This piece names the five axes, gives a defensible range for each, and tells you what a finance team should be able to verify before approving the line. It is a spoke under the AI project economics manifesto, which argues for the budget framework this decomposition implements.
Why a 5-axis decomposition
A finance team’s job on an AI project is to verify that most line corresponds to work that will happen and that nothing has been rolled into a category big enough to hide it. The 2018 template fails this job because three of the five real cost axes are not on it: eval discipline is folded into “engineering,” observability is rounded into “infra,” and post-launch maintenance is sized as a small percentage of build with no SLA attached.
Naming the five axes separately gives finance four things the legacy template cannot: a bottoms-up sanity check (each axis has a defensible range), an audit trail (each axis has a verifiable artifact), a change-order defense (a vendor asking for more money in month seven has to point at which axis), and a multi-year P&L line that fits AI’s actual cost curve.
Axis 1: Engineering hours (senior plus agent leverage)
What it covers: hours spent designing and writing the application code. Backend agent runtime, retrieval, tool integrations, the user-facing surface, deployment. The work that produces the system the user interacts with.
How to estimate it. Bottoms-up by component, with two 2026-specific adjustments. First: seniority mix is heavier than on legacy software. AI engineering judgment lives mostly in senior heads; model abstractions, retrieval design, prompt design, failure-mode taxonomies; work that is structurally hard to delegate. Plan 60–75 percent senior or staff hours, not the 30–40 percent typical of CRUD. Second: agent leverage is real. Coding agents (Claude Code, Cursor agent modes, Codex CLI) compress mid-level engineering work but only when paired with senior judgment. The honest planning multiplier is 1.5x–2.5x developer throughput on well-scoped backend work, smaller on architectural decisions.
Defensible range: typically 35–45 percent of total project cost on a 12-month build. Below 30 percent and the project has under-scoped real engineering work; above 50 percent and the project has rolled eval discipline or observability into engineering, hiding axes three and four.
What finance should verify:
- Hours-by-grade table. How many senior, mid, junior hours, by component. The grade mix should look like an AI project, not a 2018 CRUD project.
- Agent-leverage assumptions. Does the budget assume any throughput multiplier from agentic coding tools, and is it documented? “We did not bake in any agent leverage” is a fine answer in 2026; “we assumed 5x and that’s why this is cheap” is a budget that will miss.
- A separate line for ML-specific architecture work. If the line does not exist, ML decisions will be made in passing during application engineering, which is how systems end up with a non-replaceable model abstraction halfway through.
Axis 2: Inference and infra (variable, pass-through)
What it covers: model API spend, embedding spend, vector database hosting, application compute, queueing, storage. The variable cost of running the system at scale.
How to estimate it. Token-by-workload modeling. Take expected query mix at year-one volume, multiply by input + output + reasoning tokens per query, multiply by per-token model cost, add 20–40 percent for retries and failed tool calls. Add embedding spend, vector database TCO, and standard application infra. Stress-test against three workload scenarios; base, 2x, 5x; because adoption volume routinely outruns initial estimates by an order of magnitude.
Defensible range: 8–18 percent of build year cost on most enterprise AI applications, climbing to 25–40 percent on high-throughput consumer surfaces or agentic workloads with heavy reasoning tokens. Reasoning models (Claude with extended thinking, OpenAI o-series, Gemini 2.5 thinking) cost substantially more per query than non-reasoning models; if the architecture relies on reasoning models for the hot path, model that explicitly.
What finance should verify:
- The token model itself. Spreadsheet, not deck. Inputs visible. Per-token cost referenced to actual provider pricing pages.
- Pass-through structure. Inference is billed direct from model provider to buyer, not marked up by an agency holding the API keys. We make the structural argument in the AI agency manifesto and the contracting argument in the agency tax piece; a flat-fee arrangement that buries inference in agency margin is the line item most likely to surprise.
- Cost-per-completion target on the dashboard. The number that engineering optimizes against and the CFO sees. If they are different numbers, the project will optimize against the wrong one.
Axis 3: Eval discipline (often 30 percent)
What it covers: test set construction, eval harness build, regression triage, model-upgrade re-eval. The work that produces the system’s correctness, separately from the work that produces its features.
How to estimate it. Four sub-budgets: test set construction (200–2000 inputs with rubric-graded outputs, 8–12 percent), eval harness build (5–8 percent), regression triage (10–14 percent), model-upgrade re-eval (6–10 percent annualized for the three-to-five frontier upgrades a 12-month engagement sits through).
Total defensible range: 28–42 percent of total project cost. The legacy template’s largest blind spot. Corroborated by public eval-tooling cost data from Promptfoo, Inspect, OpenAI’s Evals framework, and Anthropic’s eval engineering posts. Detailed line decomposition is in the hidden cost of AI evals piece.
What finance should verify:
- A named eval engineering owner on the project, distinct from feature engineering. If “engineering will do evals” is the answer, the project has structurally absorbed eval discipline into feature engineering and will under-deliver both.
- The eval suite itself, with read access from kickoff. Not “delivered at the end.”
- Threshold-locking process. Who chooses the threshold, against what business outcome, and how regressions get re-locked. This is the most direct competence signal an AI engineering team produces.
Axis 4: Observability and ops
What it covers: trace storage, replay tooling, online eval scoring, dashboards, on-call rotation, incident response. The infrastructure that lets the team know whether the system is still working in production.
How to estimate it. A one-time build line (traces, dashboards, online evals) plus a recurring percentage of inference spend (15–25 percent for trace storage, replay, online scoring). On-call sized against the SLA: a 24/7 rotation runs ~1.5 FTE-equivalents annualized; business-hours, ~0.5 FTE.
Defensible range: 8–15 percent of total project cost when the project genuinely owns its operations, lower if the buyer’s existing SRE team absorbs incident response. Rolling observability into “infra” is the most common way this axis disappears from a budget. It does not disappear from reality; it just stops being governed.
What finance should verify:
- Observability shipping in week one of build, not month nine. The budget for it lives in the build phase, not in post-launch.
- Trace retention policy. How long are reasoning traces stored, who can query them, what is the cost. AI traces are larger than legacy logs and the storage cost is non-trivial at scale.
- Dashboards that show eval score and unit cost on the same screen. The two metrics are co-optimized; if they live in different tools they get optimized against different targets.
Axis 5: Post-launch maintenance retainer
What it covers: ongoing eval suite maintenance, model-upgrade re-evals, regression remediation, prompt and retrieval drift fixes, small feature iterations, on-call coverage. The work that keeps the system at its eval threshold after launch.
How to estimate it. Percentage of build cost, annualized. Working ranges in 2026: 25–30 percent for low-regression systems (stable workload, infrequent content updates), 30–40 percent for typical production systems, 40–60 percent for high-stakes or rapidly-evolving systems. A 2018 template that sized “support” at 10 percent of build is under-budgeting AI maintenance by half.
Defensible range: 25–40 percent of build cost, annualized, multi-year. The retainer number on year one is the most-likely-correct number for years two and three; AI maintenance does not decay the way legacy software maintenance does.
What finance should verify:
- A retainer contract with a named SLA: eval suite re-runs weekly, regressions triaged within 48 hours, model-upgrade re-evals within two weeks of a major release.
- Usage transparency. The retainer is billed monthly with hours broken down by activity, not as a flat insurance premium where the agency keeps the difference.
- A multi-year P&L line, not a quarterly contingency. The CFO who modeled the project as build-then-decay is wrong; the CFO who modeled it as build-plus-retainer is right.
What the five axes look like together
A representative 12-month enterprise AI project budget, decomposed across the five axes:
| Axis | Defensible range | What finance verifies |
|---|---|---|
| 1. Engineering hours | 35–45% | Hours-by-grade, agent-leverage assumptions, ML architecture line |
| 2. Inference and infra | 8–18% (build year) | Token model, pass-through structure, cost-per-completion |
| 3. Eval discipline | 28–42% | Named eval owner, eval-suite read access, threshold-locking process |
| 4. Observability and ops | 8–15% | Week-one observability, trace retention, eval+cost on one dashboard |
| 5. Post-launch maintenance | 25–40% of build, annualized | Retainer SLA, usage transparency, multi-year P&L line |
Two reads of the table. First, the percentages do not sum to 100 because axes 2 and 5 cross fiscal years differently from axes 1, 3, and 4. The build year shows 1+3+4 dominant; the steady-state run-rate shows 2+5 dominant. Confusing the two produces some of the most common budget surprises on AI work.
Second, the axes are partially substitutable but only within a narrow band. Cutting eval discipline (axis 3) below 25 percent is the single most reliable way to ship a system that fails its first board review. Cutting observability (axis 4) below 8 percent is the most reliable way to discover a regression six weeks after a customer hits it. Cutting maintenance (axis 5) below 25 percent is the most reliable way to watch eval scores drift downward unattended for a year.
The five axes are not optional categories. They are the actual shape of an AI project’s cost. Budgets that name them are auditable. Budgets that hide them are change-order machines.
Frequently asked questions
How is this 5-axis decomposition different from a standard software project budget?
A standard software project budget has roughly two axes; engineering plus infra; with a small contingency. The 5-axis decomposition splits engineering into engineering plus eval discipline (because eval engineering is 30 percent of project cost and structurally distinct from feature engineering) and splits infra into inference plus observability plus maintenance (because observability and maintenance are larger and longer-lived on AI systems). A standard 2-axis budget on AI work hides 35 to 50 percent of total cost in categories too coarse to audit.
What is a defensible range for each axis as a percentage of total cost?
In a 12-month build year: engineering 35 to 45 percent, inference and infra 8 to 18 percent, eval discipline 28 to 42 percent, observability and ops 8 to 15 percent, post-launch maintenance 25 to 40 percent of build cost annualized (which is a separate multi-year line, not part of the build-year sum). If any axis is materially outside the range, the budget has either hidden work or invented work.
Why is eval discipline a separate axis instead of part of engineering?
Because eval discipline is structurally distinct from feature engineering. Eval work; test set construction, harness build, regression triage, model-upgrade re-eval; uses different skills, runs on a different cadence, and has different deliverables than feature engineering. Folding it into engineering produces two failure modes: feature engineers under-prioritize eval work because their incentives reward shipped features, and finance cannot audit eval spend separately from feature spend. The result is consistent under-investment in correctness.
How should inference be billed; pass-through or fixed-fee?
Pass-through, billed direct from the model provider to the buyer, with zero agency markup. Inference is variable and must be visible to the buyer for unit-economics and pricing decisions. A flat fee that includes inference creates a hidden margin that grows when inference cost falls, which removes the agency’s incentive to optimize and removes the buyer’s visibility into their own variable cost line. The only legitimate exception is a small early pilot where the inference volume is low enough that the bookkeeping cost dominates.
Why is post-launch maintenance 25 to 40 percent of build cost when traditional software maintenance is 10 to 15 percent?
Because AI systems regress in ways legacy software does not. Models upgrade. User query distributions shift. Embeddings drift as content updates. The eval suite itself drifts and needs maintenance. Each is an engineering ticket that must be triaged, root-caused, and fixed against the eval bar. Sizing maintenance as a 10 percent rounding contingency is the single biggest reason AI projects produce cost surprises in their second year.
How does observability differ from logging in this framework?
Logging captures events; observability captures the reasoning trace. AI traces include tool calls, intermediate model outputs, retrieval hits, and reasoning tokens; substantially larger and richer than legacy application logs. Observability also includes online eval scoring, which runs the eval suite continuously against a sampled fraction of production traffic, and replay tooling that lets engineers re-run a real production trace through a new model or prompt. None of this is in scope for “logging” budget categories.
What artifacts should finance request to verify the budget on each axis?
Engineering: hours-by-grade table by component, agent-leverage assumptions documented. Inference: a token-by-workload spreadsheet referenced to provider pricing pages, plus a pass-through clause in the SOW. Eval discipline: a named eval owner, a draft eval suite or harness plan, a threshold-locking process document. Observability: a dashboard mockup showing eval and cost on one screen, a trace retention policy. Maintenance: a retainer contract with named SLAs, sized as percentage of build cost annualized.
What is the single most common axis mis-sizing finance teams should look for?
Eval discipline being absent or rolled into engineering. The line is most often hidden because the legacy budget template did not have a category for it, and because vendors used to selling against the legacy template do not surface it without prompting. Any AI project budget without a separately named eval discipline line in the 28 to 42 percent range is structurally indefensible.
Key takeaways
- The real cost of an AI project decomposes into five named axes: engineering, inference, eval discipline, observability, maintenance. Budgets that name fewer than five are hiding work.
- Eval discipline is the largest hidden axis, typically 28–42 percent of project cost. If your budget does not name it separately, the project is mispriced by roughly a third.
- Inference is a pass-through, billed direct from provider to buyer. A flat fee that includes inference is a hidden agency margin, not a buyer protection.
- Observability is a build-year line plus a recurring percentage of inference, sized at 15–25 percent of inference spend. Rolling it into “infra” is how it disappears.
- Post-launch maintenance is 25–40 percent of build cost annualized, multi-year, with named SLAs. AI maintenance does not decay the way legacy maintenance does.
- Each axis has a verifiable artifact: hours-by-grade, token model, eval owner, dashboards, retainer SLA. Finance teams that verify many five do not get surprised.
The five axes are not categories you choose. They are the actual shape of the project. Budgets that name them are defensible. Budgets that hide them are change-order schedules waiting to be drafted.
Arthur Wandzel