Year two of a production AI system costs 30 to 50 percent less per output than year one; but only if the budget structure forced the savings to land somewhere accountable. The default behavior of an engineering team is to silently absorb cost reduction into capability expansion, which is why most organizations cannot see their compounding return on a P&L line. The savings are real, the drivers are nameable, and each one has a typical capture rate. The work is making the budget treat absorption as a deliberate choice rather than a residual.
This is a spoke under the AI project economics manifesto. The manifesto argues evaluation is the unit of account. Compounding return is what happens to that unit across years two through five when the underlying economics keep moving in the buyer’s favor and the budget structure decides whether the buyer or the engineering team captures the gain.
The 30 to 50 percent claim, decomposed
Across mature 2026 production AI systems, year-two inference cost per output drops 30 to 50 percent against the year-one run rate. Six drivers contribute, and they stack rather than substitute. The table below names each, the typical contribution range, and what unlocks it.
| Driver | Year-2 saving | What unlocks it |
|---|---|---|
| Model-vendor price decay | 12 to 18 percent | Continued use of the same APIs at falling per-token prices |
| Prompt optimization and pruning | 8 to 12 percent | Eval-locked prompt cuts and system-prompt compression |
| Caching strategies | 6 to 10 percent | Prompt caching, retrieval caching, output caching on FAQ traffic |
| Selective distillation | 5 to 12 percent | Smaller-model deployment on high-traffic subworkloads |
| Eval-set stability | 2 to 4 percent | Stabilized eval suite reduces rerun cost |
| Observability ROI | 4 to 8 percent | Calibrated observability collapses repair cycles |
The high end of the range; closer to 50 percent; requires that many six drivers are budgeted for capture. The low end; closer to 30 percent; is what the model-vendor price decay alone delivers if the engineering team takes no further action. The middle is where most organizations land in 2026.
The claim that needs the most defending is not the magnitude. It is the assertion that these savings can be captured rather than absorbed. The next six sections take each driver in turn, and the section after lays out the budget structure that decides which side of the line the dollar lands on.
Driver 1: model-vendor price decay (12 to 18 percent)
Anthropic, OpenAI, and Google have shipped roughly 35 to 60 percent year-over-year price-per-million-token reductions on flagship-class models since 2023. Each refresh cycle releases a model that is at least equivalent in capability at a meaningfully lower price, plus a smaller model that is comparable in capability to the previous flagship at a much lower price.
A system that does nothing operationally; same prompts, same retrieval, same routing; picks up 12 to 18 percent year-2 savings simply by riding the price curve, assuming the application is not pinned to a specific deprecated model SKU. The capture mechanism is staying current with the latest API surface and naming the vendor’s pricing trajectory in the original budget so finance is not surprised when the saving lands.
The risk: vendor price drops are not uniform. Long-context premium pricing, vision pricing, and tool-use pricing have many moved on different curves than text generation. A budget that assumes one curve across many usage modes will under-forecast on text-dominant workloads and over-forecast on multimodal-heavy ones.
The capture move: name the expected price decay in the year-2 budget as a specific line, not an aspiration. Sample wording: “Inference run rate at year-2 commencement assumed to be 85 percent of year-1 baseline based on vendor pricing trajectory; deviation from this baseline triggers quarterly review.” The line item creates a question the team has to answer rather than a savings stream finance hopes for.
Driver 2: prompt optimization and pruning (8 to 12 percent)
Year-one prompts are bloated. They accumulate few-shot examples added speculatively, system-prompt clauses appended through ten months of incremental edits, and multi-turn flows that rebuild context redundantly across turns. None of this is mistakes; each addition was earned at the time it was made. By year two, the eval set is stable enough to test which of those additions still earn their token cost.
Three patterns produce the 8 to 12 percent savings.
Few-shot pruning. Walk through most few-shot example in production prompts. For each, ablate it from the prompt and rerun the eval. Examples that do not move the eval score by more than 0.5 percent come out. Typical removal rate: 30 to 50 percent of accumulated few-shot content.
System-prompt compression. Year-one system prompts are often 800 to 2,000 tokens. By year two, structured rewriting against the eval set can compress these by 30 to 60 percent without measurable accuracy loss. The rewrite is engineering work, not a one-line config change, but it lands the savings on most subsequent call.
Multi-turn context discipline. Conversational and agent flows often re-send the full history each turn when only the delta is needed. Restructuring to send minimal turn-relevant context cuts input tokens 40 to 70 percent on multi-turn workloads. This is the one of the largest line on chat-shaped products.
The capture mechanism is an eval-locked threshold. Without the threshold, prompt cuts produce regressions instead of savings, and the engineering team correctly resists the optimization. With the threshold, the cuts are bounded by a measurable test, and the engineering team can ship them confidently. Detail in the hidden cost of AI evals.
Driver 3: caching strategies (6 to 10 percent)
Three caching layers contribute, with different operational profiles.
Prompt caching at the model vendor. Anthropic and OpenAI both expose prompt caching APIs that reduce input-token cost on stable system prompts. Enabling it is nearly free; the operational change is structuring prompts so the cacheable prefix is identifiable and stable. On systems with 1,000-plus token system prompts, this alone delivers 3 to 5 percent of run-rate savings.
Retrieval caching. Embedding and search cost on repeated queries can be cached at the application layer. Typical hit rates on production retrieval-augmented systems run 15 to 30 percent; most cached hit is a saved embedding call plus a saved vector search. Contribution: 1 to 3 percent of run-rate savings.
Output caching for FAQ-shaped traffic. Some workloads have a heavy long tail of repeated user intents. For these, caching the full output (with eval-validated TTL) eliminates the model call entirely on the cached fraction. Contribution: 2 to 4 percent of savings on systems where 10 to 25 percent of traffic is FAQ-shaped. The risk is stale-answer regression on edge cases, which is why this layer requires careful eval gating.
The capture mechanism for many three is making caching a year-2 engineering line item with a measured savings target, rather than a “we should cache more” sentiment. A team that does not budget caching engineering hours will not ship caching at the rate the saving requires.
Driver 4: selective distillation (5 to 12 percent)
Distillation is moving a workload from a flagship model to a smaller, cheaper model that achieves 95 percent of the flagship’s eval score on that specific subworkload. It is not training a new model from scratch; it is identifying where a smaller model already in the vendor catalog (or fine-tuneable) can carry traffic at one-fifth to one-tenth the per-output cost.
The economics work when traffic on a defined subworkload exceeds roughly 100,000 requests per month, and when the eval set is granular enough to validate the smaller model’s accuracy on that subworkload specifically. Below that threshold, the engineering investment in distillation routing exceeds the inference savings.
Typical distillation outcome: a 5 to 12 percent contribution to year-2 savings, with the wide range depending on what fraction of total traffic is amenable to a smaller model. Customer-support and classification-shaped workloads tend to be heavily distillable; agentic and reasoning-shaped workloads tend not to be.
The capture mechanism is an explicit subworkload analysis at the start of year two: for each major usage pattern, what is the smallest model that achieves the locked threshold, and what is the routing complexity to deploy it. The decision to distill should be backed by a one-week analysis, not a quarterly debate.
Driver 5: eval-set stability (2 to 4 percent)
Eval reruns are a real cost line in year one; most prompt change, retrieval tweak, and model upgrade triggers a full eval suite that consumes inference and engineer time. The hidden cost of evals across the project lifecycle is detailed in why your AI project budget should have a model deprecation reserve.
By year two the eval set has stabilized, the threshold is locked, and reruns become rarer and cheaper. The savings here are modest in absolute terms; 2 to 4 percent; but they sit inside a larger structural improvement: engineers stop debating eval methodology and start debating system performance. The qualitative gain is larger than the cost line suggests.
The capture mechanism is treating eval-set version control as a discipline. Each major eval-set change is logged, justified, and dated. Year-2 budgets assume an eval-set rebuild rate roughly 60 percent lower than year one. If the rebuild rate exceeds that ceiling, the eval set is not stabilizing and a deeper question is open.
Driver 6: observability ROI (4 to 8 percent)
Year-1 observability produces noise. The instrumentation is being calibrated, the dashboards are being built, the alert thresholds are being tuned. Repair cycles in this period are still triggered primarily by customer reports, and observability is a confirmation tool rather than a detection tool.
Year-2 observability produces signal. Regressions are detected before customers report them, the on-call rotation has institutional memory of which alerts matter, and the cost ratio between observability-detected and customer-detected issues collapses repair time by a factor of three to six. The 4 to 8 percent savings shows up not as inference cost reduction but as engineering-hour avoidance; sprints that would have been spent on reactive repair are spent on planned capability instead.
The capture mechanism is naming observability as COGS rather than OpEx in the budget structure. The economics manifesto’s principle four argues this directly: observability sized at 15 to 25 percent of inference spend produces a system whose health is visible on the same dashboard as its unit economics, and whose year-2 repair velocity reflects that visibility.
Capture vs absorption: the budget structure that decides
A dollar of inference saved becomes a dollar of additional complexity unless the budget structure forces it to remain a savings line. The default behavior of an engineering team; correctly, given how their incentives are usually structured; is to spend savings on capability expansion. More retrieval, more tool calls, larger context windows, deeper reasoning chains. Each capability addition is defensible on its own terms; collectively they absorb the entire compounding return.
Three structural moves convert absorption into capture.
Move one: set a year-2 inference cost target as a contracted line item. Not an aspiration, not a stretch goal, a contracted target. The target should reference the year-1 baseline and the expected savings drivers explicitly. Sample: “Year-2 inference cost per active user at 60 percent of year-1 baseline, decomposed across the six savings drivers in attached schedule.”
Move two: require capability expansion to be funded separately. Year-2 capability work has its own budget line. It is not silently financed by inference savings. If the engineering team wants to deploy a more sophisticated retrieval layer, it must be approved as a capability investment with a separately-articulated ROI, not as a “we found efficiency” residual.
Move three: run a quarterly cost-decomposition review. Each quarter, name which of the six savings drivers landed against forecast and which did not. Drivers under-performing forecast trigger an investigation; drivers over-performing trigger a budget revision rather than silent absorption. This converts the savings stream from a hope into a measured outcome.
A buyer who runs many three moves captures most of the compounding return on the P&L. A buyer who runs none of them watches the savings disappear into capability complexity that no one explicitly approved. The math is the same; the budget structure is what decides who gets the dollar.
Frequently asked questions
Does the 30 to 50 percent range hold for retrieval-heavy systems specifically? The range holds, but the driver mix shifts. Retrieval-heavy systems get more contribution from caching (toward the upper end) and less from distillation (since retrieval dominates cost more than generation).
How much of the saving is real cash savings versus avoided cost growth? About 60 to 70 percent is real cash savings against the year-1 baseline. The remainder is avoided cost growth; usage volume often grows year-over-year, and the savings drivers prevent that growth from compounding into cost.
Does the model improve year-2 ROI on agentic systems with tool use? Yes, but the curve is rougher. Agent workloads have more variance per request and harder eval surfaces. Year-2 savings on agents land in the 22 to 38 percent range typically, lower than chat-shaped workloads.
What happens to the savings curve when a major model deprecation hits in year 2? Deprecation events temporarily reverse the savings trajectory. The migration window costs eval reruns, prompt re-tuning, and routing changes that consume saved budget. Detail in the model deprecation reserve article linked above.
Is it ever worth running a system on a deliberately older model to avoid year-2 migration cost? Rarely. The savings from running on the latest API surface usually exceed the migration cost over a 24-month horizon. Pinning to old models is a defensive move that costs more than it protects against.
Should the year-2 savings forecast be presented to the board as a single number? No. The decomposition is the point. A board that sees “30 to 50 percent” as a number cannot govern it; a board that sees the six drivers and their capture rates can ask the right quarterly questions.
How does this interact with the AI project Rule of 40? Compounding return on the cost side is what makes adjusted Rule of 40 achievable for AI-native SaaS. Without year-2 cost reduction, inference COGS as percent of revenue grows with the customer base and Rule of 40 becomes structurally harder to clear.
What if the engineering team disagrees that capability expansion should be funded separately? The disagreement usually surfaces a deeper one about who owns the year-2 economics. Resolving it at the finance/engineering interface is the work; absorbing the savings in the meantime is the cost of not resolving it.
Does the model apply to fine-tuned smaller models the team trained themselves? Mostly yes, but the price-decay driver is replaced with a hosting-cost decay driver. Self-hosted inference cost on commodity GPU has fallen at a comparable rate to vendor pricing through 2025 to 2026.
Key takeaways
- Year-two AI inference cost drops 30 to 50 percent against year-one baseline across mature production systems.
- Six drivers stack: model-vendor price decay (12 to 18 percent), prompt optimization (8 to 12 percent), caching (6 to 10 percent), distillation (5 to 12 percent), eval-set stability (2 to 4 percent), observability ROI (4 to 8 percent).
- Capture requires explicit governance. Default behavior absorbs savings into capability expansion.
- The budget structure that produces capture: contracted year-2 cost target, separately-funded capability work, and quarterly cost-decomposition reviews.
- Each driver has a specific unlock. Driver one needs no operational change; driver four needs a one-week subworkload analysis. The unlocks are nameable, not vague.
- Distillation pays off above roughly 100,000 requests per month per subworkload; below that the engineering investment exceeds the saving.
- Observability ROI shows up as engineering-hour avoidance rather than inference cost reduction. Name it as COGS to make it visible.
- Compounding return on cost is the structural reason AI-native SaaS Rule of 40 math works at scale; without it, inference COGS dominates growth.
Arthur Wandzel