Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 12 min read

Anatomy of a runaway AI project: 5 cost-side root causes

Anatomy of a runaway AI project: 5 cost-side root causes

Runaway AI projects do not run away because the team got unlucky. They run away because the budget template they were priced against did not have line items for the five places AI work leaks money. Eval-debt absorbed into agency margin. Inference variance treated as a fixed number. Scope creep through small favors that rarely trigger a change order. Model-upgrade re-evaluations nobody put on the calendar. A post-launch retainer skipped to make the proposal cheaper. Each is individually fixable. Collectively, they are the dominant cause of AI cost overruns in 2026; and they many sit on the cost side, which means a contract can prevent them. This piece names the five, gives each a failure shape, a leading indicator, and the contract clause that closes the leak. It ends with a 90-day “is your project running away?” checklist.

This is a spoke under the AI project economics manifesto, which argues AI projects need an economics framework built around evaluation cost, not the legacy feature-cost framework. The five causes below are what happens when a 2018 budget template meets a 2026 AI project; the predictable residue of a category mismatch, present in some combination in roughly four out of five post-mortemed projects.

Why “runaway” is a budgeting category

The reflex when a project runs over is to look at execution: team velocity, PM rigor, buyer discipline. Real questions, but second-order. The first-order question is whether the budget template the project was priced against had categories for the work the project would inevitably do.

For AI projects in 2026, the answer is almost usually no. The legacy template; engineering build, infrastructure, support contingency, license fees; has no row for eval engineering, model-upgrade re-evaluation, inference variance, or an observability retainer. The work happens anyway; it just gets paid for in places the budget did not name. Sometimes the agency absorbs it into margin until margin runs out. Sometimes the buyer pays it as scope-creep change orders that arrive in waves. Sometimes the project is killed three quarters in, when trust has already eroded.

What follows is five specific places the budget template fails. Each gets a failure shape, a leading indicator, and the contract clause that closes the gap.

Root cause 1: Eval-debt absorbed, not billed

Failure shape. The SOW prices the project on a feature list; “build the agent, integrate retrieval, ship orchestration.” “Eval” rarely appears as a billable line. The agency builds evals anyway and absorbs the cost into margin. By month four or five the team is spending 30 to 40 percent of capacity on eval work the contract does not pay for. Margin runs out. Eval engineering slows. Regressions ship. The buyer is told the issue is “model quality.” The actual issue is contract structure: eval engineering does not look like the engineering categories finance recognizes; not a feature, not infrastructure, not unit tests; so it gets buried in the “engineering” bucket without being separately sized.

Leading indicator. Track the ratio of eval to feature work in weekly notes for the first eight weeks. Eval below 25 percent of total time means the team is either skipping or absorbing; both runaway flags. Healthy engagements run eval at 30 to 40 percent from week one, with the line visible.

Contract clause. Name eval engineering as a separately sized SOW line, no less than 30 percent of project value, with sub-lines for test set construction, harness build, regression triage, and model-upgrade re-eval. Tie 30 percent of contract value to eval-threshold milestones rather than feature acceptance. Buyer gets read access to the eval suite from kickoff, not at delivery.

Root cause 2: Inference-cost variance unbudgeted

Failure shape. The proposal lists inference as a single dollar amount; say, $4,000 per month; from a back-of-envelope token estimate. Real production usage is shaped by three forces the estimate did not see: workload distribution shifts as actual users come online, model-mix changes as upgrades arrive, and retrieval-expansion silently grows the context window per call as RAG and tool use mature. By month seven the inference line runs at $14,000 per month and the CFO asks why a fixed budget item has tripled. Finance treated inference like a hosting line; stable, traffic-flexed. It is not. Inference is a usage-based cost on a substrate whose unit price moves, in a workload whose token consumption per request is non-stationary. As the inference cost analysis argues, inference is structurally the new database line.

Leading indicator. A flat inference forecast. If the budget shows the same dollar amount most month for 12 months, the forecast is wrong. Real forecasts have bands (low/expected/high) per quarter and a re-forecasting cadence. Absence of bands is the indicator.

Contract clause. Inference is a pass-through line, not agency margin: a budget band per quarter, a monthly reforecast obligation, and a hard threshold above which the buyer must be alerted within 48 hours. Token markups, per-call fees, and undisclosed model-routing margins are explicitly excluded.

Root cause 3: Scope creep through “small favors”

Failure shape. None of the requests are big enough to trigger a change order. The buyer asks for one extra column, one extra evaluator, one extra tool, one extra retrieval source. The PM says yes to each. After ten weeks the team has spent 18 to 22 percent of capacity on requests not in the SOW, and not one change order has been written. The project is structurally over budget but invisible to finance. By the time it surfaces; usually as a missed deadline two months later; the favors are too entangled with delivered scope to unwind. Change-order processes assume scope changes are discrete; small favors are continuous and invisible.

Leading indicator. Favor velocity. Count out-of-SOW requests per week that were not written up. Above two per week for four consecutive weeks is a structural runaway flag, regardless of how friendly the relationship feels.

Contract clause. Add a structured small-changes allowance; say 40 hours per quarter for sub-change-order requests, drawn down transparently and reported monthly. Beyond it, requests auto-convert to change orders or defer. This turns an invisible leak into a visible, governed line.

Root cause 4: Model-upgrade re-evals not in budget

Failure shape. The project ships in month six and hits its eval thresholds. Two months later Anthropic, OpenAI, or Google releases a non-trivial model upgrade. The buyer wants the new model; partly for cost, partly for capability. Re-evaluation takes two to four engineering weeks: re-run the eval suite, triage a 5 to 15 percent regression rate, adjust prompts and retrieval, re-lock thresholds. None of this was in the SOW. The agency either eats the cost (eroding margin), delays the upgrade (eroding the buyer’s position), or writes a change order (eroding the relationship). The 2018 template assumes a stable runtime. Frontier model upgrades are frequent (three to five times per year), partially mandatory (older models get deprecated), and behavior-changing in ways that require structured re-evaluation.

Leading indicator. A 12-month engagement budget with no re-eval allowance. Ask during contract negotiation: “How many model-upgrade re-evals does this budget assume, at what cost?” If the answer is zero or “we’ll handle that as it comes up,” the project is already 8 to 16 weeks under-budgeted before kickoff.

Contract clause. A named re-eval allowance: two to four full re-evals per 12-month engagement, sized at two to four engineering weeks each, with explicit triggering conditions. Buyer chooses when to draw down. Unused allowance carries over or refunds. This single clause prevents the most predictable AI overrun there is.

Root cause 5: Post-launch retainer skipped

Failure shape. The build wraps. The buyer sees a “post-launch eval retainer” line at $12,000 to $25,000 per month and asks if it is necessary. The agency, eager to close, defers it for “the first 90 days.” Ninety days pass. No regressions surface visibly because nobody is running the evals. The first model upgrade lands without re-evaluation. By month six the system has degraded measurably; accuracy down, hallucination rate up, latency tail expanded; but the team that built it is gone. A new build budget is requested to “fix” the system, typically two to three times the cost of the retainer that would have prevented the degradation. The retainer has no corresponding feature, so it is the easy thing to cut when the proposal needs to look cheaper. As the post-launch AI support analysis argues, the retainer is the cheapest insurance an AI project carries; and the most often skipped.

Leading indicator. Go-live with no retainer signed. If the system enters production without a contracted owner for evals, observability, prompt drift, and model upgrades, the runaway is already scheduled.

Contract clause. Make the retainer auto-trigger at launch unless waived in writing by the buyer’s executive sponsor. Default cost: 15 to 25 percent of inference spend plus a fixed eval-maintenance fee, billed monthly. Covers regression detection, model-upgrade re-eval triggering, prompt-registry maintenance, and observability ownership. Waiving requires a signed acknowledgment that the buyer has staffed an internal owner with named eval responsibility.

The 90-day “is your project running away?” checklist

The earliest a runaway is reliably detectable is 90 days. Before that, leading indicators are too noisy. After that, cost is increasingly entangled with delivered work and the unwind cost grows. Run this checklist at the 90-day mark. A project that fails three or more is either already running away or will be by month six.

  1. Eval engineering visibility. Is eval engineering a separately sized budget line, running at 30 to 40 percent of capacity, with weekly visibility into eval-pass curves? Pass requires yes to many three.
  2. Inference reforecast. Is the inference line being reforecast monthly, with low/expected/high bands, and is actual usage tracking inside the bands? Pass requires bands and a monthly cadence, not a flat number.
  3. Favor velocity. Is the count of out-of-SOW requests metered, capped, and reported? Sustained velocity above two per week for four weeks without a change-order conversion is a fail.
  4. Next model-upgrade re-eval scheduled. Is the next re-eval on the calendar with allocated engineering weeks and a named owner? “We’ll handle it when it comes up” is a fail.
  5. Post-launch retainer signed. Is the retainer signed before the build wraps, or has it been explicitly waived in writing by the buyer’s executive sponsor? Unsigned and not waived is a fail.

The checklist is binary on purpose. Fuzzy answers in cost governance are how runaways arrive. None of these causes are clever or hard to anticipate; they are the same five, project after project. They run away because the budget template was written for a category of software that no longer matches what an AI project is. Replace the template; the runaways stop. The clauses above are the replacement.

Frequently asked questions

What is a runaway AI project? A project whose total spend ends up 25 to 60 percent above approved budget without scope expansion finance signed off on. The overrun clusters into the five causes above, usually in combination of three or more.

What is eval-debt? The unbuilt test infrastructure required to verify an AI system works. When the SOW is priced on features, the agency still has to build evals but cannot bill for a line that does not appear in the contract.

How much can inference variance blow a budget by? 30 to 200 percent on the inference line over a 12-month engagement, driven by workload shifts, model-mix changes, and retrieval expansion.

Why are small favors a leading indicator of runaway? Each falls below the change-order threshold but collectively they consume 15 to 25 percent of project capacity. Favor-velocity above two per week for four weeks is the indicator.

How often do model-upgrade re-evals need to be budgeted? Three to five times per year. Each takes two to four engineering weeks. A 12-month engagement without an allowance lives 8 to 16 weeks under-funded.

What goes wrong when retainers are skipped? The system has no owner once the build team rolls off. Evals stop, regressions go undetected, the first model upgrade lands without re-evaluation, and a new build budget is requested 90 to 180 days later at two to three times the retainer’s cost.

Are these five causes equally likely? No. Eval-debt and small-favor scope creep are most common, in roughly two-thirds of runaways. Inference variance is structurally guaranteed but its severity varies. Model-upgrade re-evals are universal beyond month nine. Skipped retainers are the longest-tail cause, surfacing only after launch.

Can a contract prevent these runaways? Yes. The five clauses collectively prevent 70 to 85 percent of runaways. The rest are buyer-side governance failures no contract fixes unilaterally.

Key takeaways

  • Runaway AI projects are a budgeting-template failure, not an execution failure; the residue of a 2018 budget template meeting a 2026 AI project.
  • The five causes; eval-debt absorbed, inference variance unbudgeted, small-favor scope creep, model-upgrade re-evals missing, retainer skipped; appear in some combination in four of five post-mortemed projects.
  • Each cause has a 90-day leading indicator and a specific contract clause that closes it. The clauses are paste-into-the-SOW concrete.
  • The 90-day checklist is binary. Failing three or more means the project is already running away; the only question is when finance will notice.
  • Replace the template before signing, not the team after the overrun. Pair with the economics manifesto and TCO decomposition for the full Pillar 2 view.

Last Updated: Jun 11, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles