Most AI projects still buy inference at retail. They pay on-demand pricing on most token, most action, most day, and accept whatever rate the vendor publishes that month. That is the right strategy for projects under roughly $40K per month of inference spend; above that, it is leaving 30 to 50 percent on the table that reserved capacity, provisioned throughput, or committed-use discounts would have captured. This piece walks the breakeven math, names the four contract patterns worth knowing in 2026, and describes the failure modes that erase the savings if procurement is not paying attention.
The argument sits inside the AI project economics manifesto: inference is a pass-through line, but pass-through does not mean unmanaged. The buyer’s job is to make sure the agency or in-house team is buying inference at the rate the workload deserves.
What reserved AI capacity is in 2026
Reserved AI capacity is the inference-layer analog of reserved cloud compute. Instead of paying per-token at retail, the buyer commits to a minimum throughput, dollar volume, or token volume over a defined term, and receives a discounted rate plus capacity guarantees. In 2026 the major vendors many offer some version: Anthropic’s priority capacity contracts, OpenAI’s committed-volume and provisioned-throughput tiers, Google’s committed-use and reserved-throughput options on Vertex, AWS Bedrock’s provisioned throughput, and Azure OpenAI’s PTUs.
Three things changed between 2023 and 2026 that make this matter more than it used to. First, AI workloads moved from spike-y prototypes to predictable production volume; the volatility that made reservation risky has compressed. Second, the discount steepened. Frontier vendors now offer 25 to 50 percent off on-demand for serious commits, where 2023 discounts capped around 15 percent. Third, capacity is no longer usually available on demand at frontier-grade tiers; reservations buy guaranteed throughput, not just a discount.
The buyer-side question is not “should we reserve?” The question is “at what threshold of monthly inference spend, and on which workload, does reserving cross over from theoretical savings into realized savings?” The breakeven math is below.
The four contract patterns to know
Reserved AI capacity is not a single product. The four patterns below are the structurally distinct shapes you will encounter; they price differently and fail differently.
1. Committed-use discount (CUD). The buyer commits to a minimum monthly dollar spend over 1 to 3 years. The vendor discounts the published per-token rate by a fixed percentage. Overage is billed at on-demand. Underuse is billed at the commit floor. This is the simplest pattern and the one that fails most often when usage is volatile.
2. Provisioned throughput / dedicated capacity (PTU). The buyer reserves a specified throughput (tokens per minute, requests per second) on dedicated infrastructure. The pricing is by reserved capacity, not by token. Overage hits a ceiling and may queue. Underuse is paid for regardless. This is the right pattern for high-volume, latency-sensitive workloads.
3. Volume tier with rate cards. The buyer enters a tier based on negotiated annual volume; the rate card adjusts down at each tier breakpoint. There is no commit floor in some implementations. This is functionally a discount-by-volume on on-demand and is the gentlest pattern, but the savings are smaller.
4. Multi-model marketplace commits. The buyer commits to a vendor (AWS, Azure, GCP) for total inference dollars across a model marketplace and receives a discount that is portable across the vendor’s hosted models. This is the right pattern for buyers running model-routing strategies where the optimal model for a given request changes over time.
The right pattern depends on workload predictability, latency sensitivity, and routing strategy. The pattern almost usually determines the realized savings more than the headline discount rate.
The breakeven math
The breakeven is determined by three numbers: monthly inference spend at on-demand rates ($M), the discount on the reserved tier (D), and the utilization rate of the reserved capacity (U). Reserved wins when monthly spend at the reserved rate plus underutilization carry beats monthly spend at on-demand.
The simple form: reserve when M × U × (1 − D) + (M × (1 − U) ÷ U) is less than M. Solving for U at any given D yields a minimum utilization that justifies the commit. At a 30 percent discount, the breakeven utilization is roughly 70 percent. At a 40 percent discount, it drops to roughly 60 percent. At a 50 percent discount, roughly 50 percent.
Translation: a reasonable rule of thumb is that the workload needs to use at least 60 to 70 percent of the reserved capacity, sustained, for the commit to pay back. Below that, the underuse penalty erases the discount and you are paying more than on-demand.
The volume threshold at which reservation starts to pay back as a strategy; not just a single contract; sits at around $40K per month of frontier-model inference spend. Below that, the operational overhead of forecasting, negotiating, and managing the commit usually exceeds the savings. Above $100K per month, not reserving is leaving meaningful money on the table; the burden is to forecast workload sufficiently to commit safely.
When on-demand still wins
Three scenarios where on-demand is unambiguously the right answer in 2026.
Pre-deployment build phase. The team is still iterating on prompts, retrieval, and eval criteria. Inference volume is unpredictable, often by 5x week-over-week. Reserving here locks the team into a commit that does not reflect the production workload. Stay on-demand until the build is past first deployable threshold.
Inherently spike-y workloads. A workload that runs at 100K actions per day on weekdays and 5K actions per day on weekends has a utilization profile that destroys reserved capacity economics. The choice is either to reserve at 5K and pay overage Mon–Fri (which often costs more than pure on-demand), or to reserve at 100K and waste 70 percent of capacity. Both are bad. Stay on-demand or use a hybrid pattern (smaller reserve plus on-demand burst).
Multi-model uncertainty. If the team has not yet decided which frontier model is best for the workload; and is actively running A/B comparisons; committing to a vendor now locks in a model choice that the eval data may overturn in 90 days. The cost of that lock-in usually exceeds the discount on the reservation. Wait until the model selection is stable.
A hybrid pattern often wins: reserve the predictable base load, run the spikes and experimentation on on-demand. We see this work well at scale. The pattern requires careful capacity forecasting, which we discuss below.
The four failure modes that erase the savings
Buyers who get reservation wrong tend to fail in one of four ways. Each is preventable.
Failure 1; Over-committing in month one. The team forecasts based on aspirational launch volumes and commits to capacity that takes nine months to grow into. The underuse charge across those nine months erases the year-one savings entirely. Mitigation: ramp the commit. Most vendors support quarterly step-ups. Start at 60 percent of forecast and ramp.
Failure 2; Forgetting overage pricing. The reserved tier covers up to capacity X, and overage above X bills at a rate that may be higher than on-demand. A workload that occasionally bursts to 1.4x of reserved capacity can spend more in overage than the reservation saved. Mitigation: model the overage scenarios explicitly. If the workload bursts above 1.2x of expected average more than 10 percent of months, lower the reservation and let the overage hit on-demand.
Failure 3; Locking into a model just before a price drop. Frontier models drop in price most 4 to 6 months. A 12-month reservation at month-zero pricing can be 20 percent more expensive than on-demand by month 8. Mitigation: prefer 6-month commits or commits with re-pricing clauses; for 12+ month commits, build in MFN (most-favored-nation) language.
Failure 4; Reservation that does not survive the model upgrade. A reservation tied to Claude 4.7 may not transfer to Claude 4.8 cleanly, and the team finds itself either paying for unused 4.7 capacity or not getting reservation pricing on 4.8. Mitigation: prefer reservations that carry across model versions in the same family, and confirm transfer rules in the contract.
The common thread: a reservation is a forecast, and bad forecasts produce bad reservations. The eval-engineering practice that anchors the manifesto also produces the workload telemetry that anchors a defensible reservation. Without trustworthy production telemetry, reservation strategy becomes a gamble; a structural risk we explore in the AI project budget anti-patterns piece.
How to write reserved capacity into the project plan
Three operational practices turn the theory into a procurement gate.
Wait until month 3 of production. The inference forecast that supports a reservation is the one built from three months of post-launch telemetry, not the one estimated pre-launch. Reservations signed before production launch are, on average, 25 percent off the actual workload shape. Wait.
Build the model with explicit underuse and overage scenarios. The economic model behind any reservation should plot the realized savings against utilization at 50, 60, 70, 80, 90 percent and against burst behavior at +10, +20, +30 percent overage. If the realized savings is positive only in the central scenario, the reservation is too aggressive.
Tie the reservation to the eval-threshold milestone. A reservation signed before the system clears its eval threshold is a reservation on a workload that may not exist as currently sized. The right gate is: reach eval threshold and 90 days of stable production traffic, then reserve. We discuss the eval-threshold gate in the AI project pricing models piece.
Frequently asked questions
What discount should I expect on reserved AI capacity in 2026?
For 12-month commits at meaningful volume ($40K+ per month) the typical range is 25 to 40 percent off on-demand list, with 50 percent achievable on 36-month commits at high volume. Frontier vendors are reasonably consistent across this range; the pattern varies more than the headline rate.
Below what monthly spend is reservation not worth the operational overhead?
Roughly $40K per month of frontier-model inference spend. Below that, the time to forecast, negotiate, manage, and reconcile the reservation usually consumes more than the savings. Volume tier discounts (the smallest pattern) can still apply at lower spend without the management burden.
How does reservation interact with model routing?
Carefully. A pure-router strategy that dynamically picks the best model for each request will under-utilize any single-model reservation. The clean pattern is multi-model marketplace commits at the cloud vendor layer. We explore the routing economics in the AI project model-routing economics piece.
What about open-source models on dedicated GPUs?
Open-source serving on rented GPU capacity (H100s, B200s) is a separate category that resembles reserved compute more than reserved inference. The economics are credible only at high volumes (millions of actions per day) and require a serving infrastructure investment that most projects under-budget. The buyer-side question is the same: forecast utilization, model the breakeven, hold a 60 to 70 percent utilization bar.
Should the AI agency or the buyer hold the reservation?
The buyer holds it. Inference is pass-through; the buyer’s procurement function carries the negotiated relationship and the financial commitment. Agencies that hold reservations on behalf of buyers have a structural conflict; they make money on the spread. We discuss the cleanly-aligned model in the AI agency manifesto.
How long should the reservation term be?
For workloads that have stabilized post-launch, 12 months is the typical sweet spot. Six-month terms reduce model-upgrade risk but capture less of the volume discount. 24 to 36-month terms only make sense at high volume and with strong MFN clauses against frontier price drops.
What is the failure mode that causes reservation regret most often?
Over-committing on month-one volume forecasts. Half the buyers we work with who have negative reservation experiences over-committed in month one and spent the next three quarters trying to grow into the commit. Ramp the reservation. Start at 60 percent of forecast.
How does this connect to the AI project economics manifesto?
The manifesto names inference as a pass-through line, not agency margin. Reservation strategy is the buyer’s discipline that turns “pass-through” from a passive concession into an active cost lever. Without reservation strategy, the manifesto’s principle 6 leaves money on the table.
Key takeaways
- Reserved AI capacity is worth the management overhead above roughly $40K per month of frontier-model inference spend. Below that, on-demand or volume-tier pricing is usually right.
- The four contract patterns are committed-use, provisioned throughput, volume-tier, and multi-model marketplace commits. The right pattern depends on workload predictability, latency, and routing strategy.
- Breakeven utilization sits around 60 to 70 percent for typical 25 to 40 percent discount tiers. A workload that does not sustain that utilization should stay on-demand or hybrid.
- The four failure modes that erase savings are over-committing, ignoring overage pricing, locking in just before a price drop, and reservations that do not survive model upgrades.
- Sign reservations after month 3 of production telemetry, model underuse and overage scenarios explicitly, and tie the reservation to the eval-threshold milestone.
Arthur Wandzel