Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 13 min read

The AI Project Budget Anti-Patterns We See Across 60 Engagements

The AI Project Budget Anti-Patterns We See Across 60 Engagements

Across the engagement pattern of 2024 to 2026 AI development work, seven budget anti-patterns recur with enough regularity that they read like a checklist. None of them is unique to a particular agency or a particular buyer. Each is a structural failure mode inherited from a 2018 software contracting template that did not have categories for the work AI projects require. Each kills budgets in a recognizable shape, and each has a structural prevention that costs less to put in place at kickoff than to retrofit during a crisis.

This is a spoke under the AI project economics manifesto. The manifesto names eight principles a 2026 finance org must internalize to budget AI projects without quarterly surprises. The anti-patterns below are what happens when those principles are absent, named in the negative for buyers and agencies that want to recognize the patterns in their own contracts.

A note on framing: the patterns below are illustrative composites observed across the engagement set. No specific client names or dollar amounts are reported; the patterns are the point, not the cases.

Anti-pattern 1: budgeting features instead of evals

Failure shape. Project budget reads as a feature list with hours estimated against each. No line item references an eval set or an eval threshold. Engineering work that does not fit the feature taxonomy; eval set construction, regression triage, model-upgrade re-evaluation; has no home in the budget and gets done as silent overhead, then surfaces as scope creep between months four and eight.

How it kills budget. The omission averages 30 to 40 percent of total project cost. Work that was usually going to happen does not appear in the original budget, then appears as overrun against a budget that was structurally incomplete. The conversation becomes adversarial because the agency cannot show the work was scoped; the buyer cannot show the budget was wrong from the start.

Prevention. Most line item in the budget references an eval set by name and version. “Build the agent” becomes “agent passing eval-set v1 at >= 0.82 weighted score on the 240-prompt enterprise test set.” Eval engineering is named as a category and sized at 30 to 40 percent of total project cost. Detail in the hidden cost of AI evals.

Anti-pattern 2: no inference reserve

Failure shape. Project budget includes an inference cost forecast based on kickoff assumptions about usage shape, prompt complexity, and retrieval depth. No reserve is sized against forecast variance. The first month’s inference invoices come in 1.5 to 3x forecast; sometimes higher on long-context or agent-heavy workloads. Finance asks why; engineering explains that real workload differs from kickoff assumption. The conversation pivots from delivery to crisis.

How it kills budget. Inference cost forecasting at kickoff is a guess against unknown distribution. Without a reserve, the variance is visible as budget overrun. With a reserve sized 15 to 30 percent above the expected inference cost run rate, the variance is absorbed inside the named line. The work continues; the conversation stays about delivery.

Prevention. Inference reserve as a contracted budget line, sized 15 to 30 percent above expected run rate. Quarterly true-up against actual usage with the reserve narrowing as forecasting accuracy improves. After year one, the reserve typically contracts to 10 to 15 percent because actual usage shape is known.

Anti-pattern 3: milestone payments without eval gates

Failure shape. Contract structures milestone payments against feature acceptance: when the agent is built, payment one. When integration is complete, payment two. When the production deployment runs, payment three. No milestone references an eval threshold. The system can pass milestones while degrading on the actual workload.

How it kills budget. Milestone payments without eval gates pay against shipping, not against working. The agency ships features, the buyer pays, and the eval bar quietly slips. Two quarters in, the buyer realizes the cumulative delivery does not perform; but the milestone payments are gone, and the engagement is past the renegotiation window.

Prevention. Each milestone names a specific threshold on a specific eval set version. Pass is the eval report; fail triggers structured remediation, not change orders. Roughly 30 percent of contract value held back against eval-threshold milestones; high-stakes engagements run 40 percent. Less than 20 percent and the holdback is ceremonial. Across the engagement pattern, this is the single strongest predictor of engagement health.

Anti-pattern 4: no kill clause budget impact

Failure shape. Contract has no termination right at defined gates, or has a termination right that is not modeled in the budget. Engagements that are not converging cannot be exited cleanly. The buyer either runs the engagement to its full term and accepts the cost, or terminates messily and accepts the legal risk.

How it kills budget. A kill clause’s budget impact is asymmetric; exercising it limits downside exposure to the spent fraction rather than the contracted total. Engagements without a budget-modeled kill clause forecast against full delivery as if no termination is possible, which inflates apparent commitment and removes the option value of bailing on a project that is not converging. The cost is the trapped capital in a non-converging project that should have been redirected.

Prevention. Contract a kill clause with structured triggers: failure to clear an eval threshold by milestone N, failure to deliver an architecture review at gate M, or failure to produce an observability dashboard by week K. Model the kill clause in the budget; present-value the option of termination at each gate. Engagements with kill clauses tend to converge because the alignment is visible to both sides.

Anti-pattern 5: no model-deprecation reserve

Failure shape. Engagement runs more than 12 months. Underlying model gets deprecated at month 9 or 14. Migration cost; eval reruns, prompt re-tuning, routing changes; needs to land in budget that has no line for it. Either the running budget gets cut to fund the migration (cutting capability work), or the migration cost surfaces as scope creep.

How it kills budget. Major model providers deprecate models on roughly 12 to 18 month cycles. Any engagement spanning more than a year is structurally exposed. Without a reserve, the forced choice damages either delivery (cut capability) or trust (carry as overrun). Either path compromises engagement trajectory.

Prevention. Model-deprecation reserve sized at 5 to 15 percent of total project cost, named in the budget and presented as a contingency line. Detail in why your AI project budget should have a model-deprecation reserve. The reserve releases at engagement close if no deprecation event occurred; buyers should treat it as insurance, not slush.

Anti-pattern 6: fixed-price for variable work

Failure shape. Contract uses a fixed-price structure for an engagement whose scope is genuinely variable; model choice, retrieval design, eval bar, latency budget, and failure-mode taxonomy many shift as the project meets real workload. The agency loses money on the work done; the buyer loses time on the work the agency declines.

How it kills budget. Fixed-price pushes the agency to defend the original scope against discovered reality, which converts scope discovery into scope-creep disputes. The relationship turns adversarial because both sides are economically right within their position and structurally wrong against the work. Engagements that begin fixed-price often end with a renegotiation that costs both sides more than a different structure would have at kickoff.

Prevention. Eval-threshold pricing for the variable portion: a fixed price covers the architecture and infrastructure setup, eval-threshold milestones cover the evaluation and capability work, and a retainer covers post-launch operation. The architecture detailed in the decline of the fixed-price AI project is structurally compatible with how AI work progresses.

Anti-pattern 7: observability bucketed as OpEx

Failure shape. Project budget treats observability as a small OpEx line; a few percent of total cost. The observability stack is built thin: a few dashboards, basic logging, alert thresholds set during early operation. When production starts, the observability stack does not detect regressions; customers do.

How it kills budget. Under-instrumented systems pay a multiplier on most other cost line. Regressions discovered through customer reports cost 2 to 4x more to repair than regressions caught by observability. Eval debt, prompt drift, and model-version sprawl many become invisible until customers report failures. The thin observability budget produces a thick repair-cost line.

Prevention. Observability sized as COGS at 15 to 25 percent of inference spend, with a one-time build line for instrumentation, plus an ongoing line for tuning and synthetic regression injection. The economics manifesto’s principle four argues this directly. The investment shows up not as inference cost reduction but as engineering-hour avoidance; sprints that would have been spent on reactive repair are spent on planned capability instead.

The order to fix them

For an existing engagement carrying multiple anti-patterns, the prioritized fix order:

OrderMoveWhy first
1Add eval gates to remaining milestone paymentsRestores alignment mechanism for remaining term
2Establish an inference reserveStops monthly budget surprise from killing trust
3Name the model-deprecation reserveProtects against the next forced revalidation
4Add or model a kill clauseRecovers option value if engagement remains non-converging
5Right-size observability as COGSCuts repair-cost multiplier going forward
6Convert variable work to eval-threshold pricingRemoves the structural adversarial dynamic
7Re-categorize budget around eval engineeringLargest change, requires renegotiation, do last

Move one alone changes engagement trajectory measurably. Moves one through four together convert a debt-loaded engagement into one with structural alignment for the remaining term. Moves five through seven are larger renegotiations that require a different posture between buyer and agency, but they pay back over the rest of the engagement and any subsequent ones with the same partner.

For a new engagement, run many seven moves at kickoff. The cost of putting them in place at the start is materially lower than the cost of retrofitting them under crisis.

Frequently asked questions

Are these anti-patterns specific to external agency engagements? No. Internal AI teams paying out of opex face the same patterns, just less visibly. Internal “milestone” maps to capacity allocation; internal “fixed-price” maps to fixed annual budget. The patterns are economic, not contractual.

How does an agency that proposes the prevention structure differentiate from one that does not? Buyers should treat agencies that propose eval-threshold milestones, inference reserves, and model-deprecation reserves as more economically sophisticated than agencies that do not. The prevention structure costs the agency more to manage but produces healthier engagements.

Should buyers walk away from agencies that resist eval-threshold milestones? Resistance signals one of two things: the agency does not know how to operate against eval thresholds, or the agency knows it cannot clear them. Either is a reason to look elsewhere. Detail in a field guide to evaluating an AI agency.

How does the inference reserve reconcile with fixed-price contracts that include inference? It does not reconcile; the two are structurally incompatible. A fixed-price contract that includes inference puts the agency on the wrong side of usage variance. Inference should be passed through at cost or invoiced separately under a budget line the buyer controls.

What is the right way to size the eval-engineering line at kickoff before the workload is known? Use 30 to 40 percent of total project cost as the planning band. After the first eval-pass cycle, the band narrows to a more specific number based on observed eval complexity. Treating the band as plan rather than forecast is the right posture.

Do these anti-patterns apply to research-shaped AI work? Partially. Research has its own anti-patterns (output-bound budgeting where the output may not exist, no kill clause for hypothesis failure). But the inference reserve, model-deprecation reserve, and observability-as-COGS patterns translate cleanly.

What about hackathon-style proof-of-concept budgets? The patterns matter less for two-week PoC work where the budget is small enough that variance is absorbed in scope. They matter critically for any engagement that translates a PoC into production, because production carries the variable cost shape the PoC budget did not.

Should procurement teams own these anti-patterns or should engineering leadership? Both. Procurement owns the contract structure (kill clauses, milestone payment terms, fixed-price vs eval-threshold). Engineering leadership owns the technical line items (eval engineering, inference reserve, observability sizing). Cross-functional alignment is what produces a budget that survives contact with reality.

How long does it take to renegotiate an existing engagement around these patterns? A focused renegotiation runs four to eight weeks. Faster is rare; slower indicates the relationship is not aligned for the conversation. Engagements that cannot complete the renegotiation in eight weeks have a deeper alignment problem the contract cannot fix.

Key takeaways

  • Seven AI budget anti-patterns recur with structural regularity: feature-not-eval line items, no inference reserve, milestone payments without eval gates, no kill clause budget impact, no model-deprecation reserve, fixed-price for variable work, observability bucketed as OpEx.
  • Each anti-pattern has a measurable cost and a structural prevention that is cheaper at kickoff than under crisis.
  • Across the engagement pattern, milestone payments without eval gates is the single strongest predictor of engagement failure.
  • Inference reserve sized 15 to 30 percent above forecast absorbs variance that otherwise pivots the conversation from delivery to crisis.
  • Model-deprecation reserve at 5 to 15 percent of total cost protects engagements running longer than 12 months.
  • Eval-threshold pricing replaces fixed-price for variable work and removes the structural adversarial dynamic.
  • Observability sized as COGS at 15 to 25 percent of inference spend cuts the regression-repair multiplier on most other cost line.
  • For existing engagements: fix in order; eval gates, inference reserve, deprecation reserve, kill clause, observability, eval-threshold pricing, full eval-engineering recategorization. Move one alone changes trajectory.

Last Updated: May 9, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles