The hourly rate is not the problem. The pricing model is. Most 2026 AI agency invoices still arrive as a blended T&M line with passthrough inference at an unstated markup, milestones tied to deliverables nobody can independently verify, and a “discovery phase” that bills like a project but commits to nothing. When inference spend triples post-launch, there is no cap to enforce. When the relationship ends, the buyer has paid for capacity they cannot reclaim.
This piece is the pricing surface of the AI agency manifesto. The manifesto argues for a new operating model; this is the commercial structure that makes it legible to procurement, finance, and a CFO who does not want a phone call about overruns. Below: six pricing principles, each with a rationale, the failure mode of the status quo, contract language for the redline, and the edge cases that test whether the principle holds.
These are clauses we are willing to sign; and clauses any AI agency that wants to be defensible against legal, finance, and a sober CTO should be willing to sign too.
Why pricing has to change in 2026
The 2018 web-development pricing playbook; T&M with optional fixed-bid milestones, passthrough hosting at cost-plus, weekly burn reports; was built for deterministic software where developer time was the dominant variable cost. AI engagements break those assumptions on three dimensions:
- Inference is the largest variable cost in many production AI features, and it scales with usage, not with build effort. A model that treats inference as “hosting” misprices the engagement by an order of magnitude.
- The build is partly the eval suite, not just the code. Without milestone-tied eval acceptance criteria, “milestone complete” reduces to the agency’s word; a pricing problem, not just a delivery one.
- Frontier models ship most six to twelve weeks. Pricing that bills the buyer to retest each upgrade incentivizes the agency to skip the test or to bill for work the buyer would not consciously approve.
The result is a category-wide trust deficit. Buyers approve a six-figure SOW and discover, three months in, that they cannot tell which line items reflect work done for them versus work the agency was going to do anyway. The fix is not better invoicing; it is a pricing structure where the line items map to something the buyer can independently verify. The six principles below define that structure.
Principle 1: Pass-through inference billing on buyer keys
The principle. Whenever feasible, production inference runs on the buyer’s own provider account. The agency builds against buyer-owned API keys; the provider invoices the buyer directly. When that is not possible; early prototypes, shared dev environments, sub-processor reasons; inference is billed at cost with a stated, capped markup, supported by the underlying provider invoice.
Rationale. Inference is not the agency’s value-add. Bundling it into a blended rate hides the fastest-growing cost line, gives the agency a financial incentive against efficiency work (caching, prompt compression, smaller-model routing), and converts a passthrough into a margin product.
What the status quo gets wrong. Two patterns dominate. The silent markup: inference billed at “cost” with a 30–60% effective margin once provider discounts and reseller credits are accounted for. And the bundle: inference rolled into a monthly retainer with no token-level visibility, so the buyer cannot tell whether their feature costs $4K/month or $40K/month until the retainer ends.
Contract language.
Inference Provisioning and Billing. Wherever technically feasible, Agency shall develop, test, and deploy against API credentials owned by Client. Where inference is procured under Agency credentials for reasons documented in the SOW, Agency shall pass through provider charges at cost plus a markup not to exceed [X]%, supported by the underlying redacted provider invoice. A monthly Inference Spend Cap shall be set in the SOW; charges in excess of the Cap require written Client approval before incurrence. Per-route and per-model token usage shall be reported monthly.
Edge cases. Shared dev environments where multiple clients exercise the same agency-owned key; handle by per-client tagging, rarely blended invoicing. Reseller arrangements where the agency receives provider credits; those flow to the buyer when running on agency credentials, not to agency margin. Fine-tuning runs are a one-shot training cost, not ongoing inference; billed at cost as a line item, not under the inference cap.
Principle 2: Eval-milestone billing for production work
The principle. For production engagements past discovery, milestone payments are conditioned on eval acceptance criteria defined before work begins. The eval suite; datasets, rubrics, judge prompts, thresholds; is delivered at each milestone. Payment releases when the buyer can reproduce the eval scores against the milestone checkpoint.
Rationale. Without a milestone-tied eval, “the AI feature works” is unfalsifiable. The agency has the operational knowledge to assert it, the buyer does not, and the milestone payment becomes a trust transaction. Eval-milestone billing turns it into arithmetic: the score either reproduces or it does not. It also prevents the most expensive AI failure mode; paying for a system that scored well in a demo and degrades silently against the production distribution.
What the status quo gets wrong. Most milestone schedules in 2026 still read like web-development schedules: “Wireframes accepted,” “Backend complete,” “UAT signoff.” None of those test the AI behavior. By the time evals are bolted on at launch, the agency has already been paid for milestones whose acceptance criteria were rarely the AI’s behavior.
Contract language.
Eval-Conditioned Milestones. Each Milestone in this SOW is associated with an Acceptance Eval defined in the Milestone Schedule, comprising (a) named datasets, (b) scoring rubrics with weights, (c) judge-model prompts and version identifiers, (d) numerical thresholds for pass, and (e) an executable script. Agency shall deliver the Acceptance Eval at Milestone kickoff and again at delivery. Milestone payment releases upon Client’s successful reproduction of the threshold score within rounding tolerance, against the delivered system checkpoint, in Client’s environment.
Edge cases. Subjective acceptance criteria (writing quality, brand tone); handle with judge-model rubrics with sample outputs locked in advance, not “we will know it when we see it.” Drift in the underlying judge model; pin the judge model version in the eval, treat a judge upgrade as a separate qualifying event. Disagreement on whether the threshold is met; append a tiebreaker procedure naming a third-party reviewer in the SOW.
Principle 3: Fixed-fee discovery weeks
The principle. Pre-engagement discovery is a fixed-fee, fixed-duration sprint with a defined deliverable: a written assessment, an architecture sketch, a draft eval suite, and a recommendation to proceed or not. The buyer pays regardless of whether they engage further. If the agency recommends not proceeding, that conclusion is a delivered artifact, not a refundable failure.
Rationale. Discovery is the most leveraged work in an AI engagement and the most likely to be done badly when unpaid or T&M. Unpaid discovery rewards velocity over rigor; the agency ships an optimistic proposal because that is what closes. T&M discovery rewards padding. Fixed-fee discovery aligns the agency to the only outcome the buyer wants from week one: an honest verdict on whether the project is real.
What the status quo gets wrong. “Free discovery, paid build” functions as a CAC the agency recovers in build margin, distorting the build estimate upward. “T&M discovery to scoping milestone” produces a 60-page Notion doc padded to justify the hours, often arriving past the buyer’s planning window.
Contract language.
Discovery Sprint. Agency shall conduct a Discovery Sprint of [N] weeks at a Fixed Fee of $[X], payable on commencement. Deliverables include: (a) a written Technical Assessment covering data readiness, model fit, and infrastructure constraints; (b) a draft Architecture and Eval Suite; (c) a Build Proposal with Milestone Schedule and SOW or a written Recommendation Not to Proceed with rationale. Discovery deliverables are Work Product owned by Client upon delivery. Engagement of Agency for the Build phase is at Client’s sole option.
Edge cases. Buyer wants discovery to roll into build seamlessly; fine, but the discovery deliverable must still be self-contained and reusable if build is scoped to a different vendor. The agency’s recommendation is “do not proceed”; a successful discovery outcome, paid in full. Discovery surfaces a scope different from what the buyer expected; the discovery contract concludes; build is a separate negotiation, not an automatic continuation.
Principle 4: No-charge model upgrade testing
The principle. When a frontier model the system depends on ships a new version, the agency runs the existing eval suite against the new version at no charge. The buyer receives the comparison report and decides whether to upgrade. Implementation work to adopt the new model is billable; the test that informs the decision is not.
Rationale. Frontier model upgrades are the highest-leverage opportunity an AI engagement has; and the biggest risk. The buyer should not have to choose between paying for a test they may not adopt or skipping the test and missing a 30% latency or cost improvement. The eval suite from Principle 2 becomes a compounding asset across the model release cycle. Charging for the test creates the wrong incentive: the agency only runs it when the buyer is in a generous quarter, which is exactly when they should not be making upgrade decisions.
What the status quo gets wrong. Either the agency forgets to test, or it proposes a paid spike to evaluate the upgrade. Both fail. The first means the production system silently misses an Anthropic or OpenAI release that would have cut cost or latency. The second turns model upgrades into procurement events, which they are not; they are routine engineering hygiene that the eval suite makes cheap.
Contract language.
Model Upgrade Evaluation. For the duration of any Active Maintenance Period under this Agreement, when a model provider listed in the SOW publishes a new model version that is a candidate replacement for any model in production, Agency shall, within ten (10) business days, execute the Acceptance Evals against the new version and deliver a written Comparison Report covering performance, latency, and cost deltas. Such evaluation is included in the Maintenance Fee. Implementation of the upgrade, if elected by Client, is billable as a Change Order.
Edge cases. Hot-swap upgrades the agency wants to deploy without buyer review; disallow; the buyer signs off even on dominant upgrades. Provider-forced model deprecations; the test is still no-charge; migration is billable, with a stated rate cap so deprecation does not become a profit event. Buyer fatigue; convert to a quarterly cadence in the SOW, rarely silent skipping.
Principle 5: Capacity reservation with monthly kill clause
The principle. Retainer engagements reserve a stated FTE-equivalent capacity at a stated rate. The buyer can terminate with thirty days’ notice at the decline of any month; no termination fee, no cure period, no “unspent hours” rollover dispute. The retainer is a service, not a subscription with an exit penalty.
Rationale. Retainer pricing is the right model for ongoing maintenance, model upgrade work, and incremental feature delivery. But two pathologies have become standard: the long-tail termination fee that makes the retainer functionally a 12-month commitment, and the unspent-hours dispute where the agency claims the buyer “owes” hours billed but not consumed. A capacity reservation is what the buyer pays for; if they no longer need it, they should be able to release it.
What the status quo gets wrong. The 2018 staff-aug retainer; friction-by-design exit; was tolerable when developer scarcity was extreme. In 2026, with frontier-model work demanding sharper iteration cycles and buyer requirements changing faster than annual planning, lock-in retainers misprice the underlying optionality. Monthly kill is also a quality-of-service mechanism; agencies that cannot retain a retainer monthly are mispricing or under-delivering.
Contract language.
Capacity Reservation and Termination. The Retainer reserves [N] FTE-equivalents of [defined seniority mix] capacity per calendar month at a Monthly Fee of $[X]. Either party may terminate this Retainer for convenience with thirty (30) days’ written notice effective at month end, with no termination fee, cure period, or trailing obligation other than payment of fees through the effective date and delivery of any in-flight Work Product. Unspent Capacity within a month does not roll forward and shall not be the subject of any chargeback.
Edge cases. Mid-engagement termination during a critical milestone; the SOW should specify which in-flight artifacts must be delivered as part of termination. The agency has hired specifically for the engagement; that is the agency’s bench risk, priced into the rate, not a buyer obligation. Repeated start-stop cycles to game capacity; handle with a one-time Reactivation Fee in the SOW, sized to cover real ramp cost, not as an exit penalty.
Principle 6: Transparent senior-vs-junior hour ratios
The principle. Most invoice and monthly report breaks out hours by named seniority tier; typically Principal, Senior, Mid, Junior; at the rates published in the rate card. The blended rate is shown as a derived number, not a billing primitive. The buyer can see the actual mix that delivered their work and challenge it if it drifts from what was sold.
Rationale. “Blended rate” is the one of the largest source of buyer-side mistrust in agency invoicing. The proposal sells a senior-heavy team at a blended $300/hr; the invoice arrives with a 70% junior mix delivering the bulk of the hours, and the buyer cannot tell because the invoice does not separate them. Transparent ratios restore the link between rate card and invoice and force the agency to make staffing trade-offs deliberately, since each one is now legible.
What the status quo gets wrong. “Team rate” billing where one rate covers everyone is convenient for the agency, opaque for the buyer, and indistinguishable from junior arbitrage. “Named team” pricing where the proposal lists individuals by name but the invoice lists hours by role allows silent substitution. Both are legible to procurement only at post-mortem stage, when they are too late to fix.
Contract language.
Tiered Rate and Mix Reporting. The Agency rate card by tier (Principal, Senior, Mid, Junior) is incorporated as Schedule A. Each monthly invoice shall report hours billed by tier, the resulting blended rate, and the cumulative engagement-to-date mix versus the Proposed Mix represented in the SOW. A divergence of more than ten percentage points in any tier between the Cumulative Mix and the Proposed Mix shall be flagged in the invoice cover note, and Client shall have the right to require a written staffing rationale before approving the invoice.
Edge cases. The agency wants to surge senior hours mid-engagement; flag and approve, do not hide. A team member promoted mid-engagement; the tier change applies prospectively only, rarely retroactively. Junior hours under senior supervision being billed at junior rates; that is the right behavior; the alternative is misrepresenting the work.
How to use this manifesto
As a buyer. Send the six principles to candidate agencies before the SOW arrives. Ask each to mark them agreed, agreed with redline, or declined with rationale. Three reactions tell you something. Agreed across the board with no redlines. Either disciplined or not reading carefully; probe Principle 5 on termination and Principle 1 on inference markup to find out which. Selective redlines on Principles 1 and 6. Reasonable; markup percentages and tier definitions are negotiable, the underlying transparency is not. Decline on Principles 2, 3, or 5. Walk-away signal; those are the structural mechanisms that turn the engagement into something procurement and finance can defend. Pair this manifesto with the seven contractual commitments, which cover work product and IP rather than commercial structure.
As an agency. If your shop cannot make these commitments, your invoices are partly extracted information rent. The fix is operational, not commercial: meter inference at the route level, write evals from project day one, treat discovery as a real product, and refuse to hide tier mix in a blended rate. Agencies that operate this way close enterprise contracts faster; legal and procurement stop fighting over the invoice format because the invoice format is no longer the dispute. For more on the structural pressures, see why most AI agencies will not survive the next 18 months and the broader field guide to evaluating an AI agency.
Frequently asked questions
Is this pricing model only for retainers, or does it apply to fixed-bid projects?
Both. Principles 1, 2, 4, and 6 (inference, evals, upgrades, tier transparency) are independent of commercial structure. Principles 3 and 5 (discovery, capacity reservation) describe two specific commercial structures the manifesto endorses. Fixed-bid build projects use Principle 2 milestone-eval billing as the primary mechanism; retainers use Principle 5. Time-and-materials remains acceptable for narrow exploratory work but is the weakest fit for production engagements.
What is a defensible markup percentage on inference passthrough in 2026?
The defensible band is 0–15%. Zero markup applies when the buyer brings their own provider account. Ten to fifteen percent covers payment-processing, treasury overhead, and the minor accounting cost of consolidating multiple sub-clients onto one agency account. Above 20–25% the agency is monetizing inference rather than building, which is a different business and should be priced as a different product. The percentage should be stated in the SOW, not buried in a blended rate.
How does eval-milestone billing handle subjective work like writing quality?
It handles it best, not worst; subjectivity is exactly what produces unfalsifiable acceptance disputes under traditional milestone schedules. The pattern is to lock a judge-model rubric and a set of canonical samples in advance, including illustrative pass and fail examples, pin the judge model version, and treat the eval as the operational definition of “writing quality” for the engagement. If the rubric is wrong, that is a rubric negotiation, not a delivery dispute; a far cleaner conversation.
What if the agency wants to commit to no-charge upgrade testing only for major model releases?
Acceptable, with explicit definition. Specify in the SOW which providers are in scope and what counts as a release that triggers the test (typically: a new model name, not a point-version patch within an existing model). The principle is that the test cadence is not negotiated at each release, because that is exactly when negotiation produces bad outcomes. Buyers should resist any version that lets the agency choose which releases to test on a case-by-case basis.
Does a monthly kill clause make retainer pricing impossible to plan against?
The opposite, in our experience. A retainer the buyer can leave at any time is one the agency has to keep earning, which forces honest capacity sizing rather than overcommitment. Agencies that insist on twelve-month minimums are typically doing so because they have already over-hired against optimistic assumptions; the buyer should not absorb that bench risk. The right resolution is for the agency to right-size capacity and price the rate to reflect the optionality the buyer is buying.
How does this manifesto handle equity, success fees, or outcome-based pricing?
It does not; by design. Outcome-based pricing in AI engagements is structurally hard because the outcome (revenue lift, ticket deflection, conversion improvement) is rarely attributable to a single system and is often unmeasurable inside a single quarter. Agencies proposing outcome-based pricing are usually either well-aligned partners willing to underwrite product risk, sophisticated parties extracting upside on work that would have been delivered anyway, or naive parties who will renegotiate when the metric stalls. None is a default model; negotiate them as bespoke arrangements on top of a manifesto-compliant base.
How do these principles interact with the seven commitments framework?
They are complementary surfaces of the same operating model. The seven commitments cover work product and IP; eval suites, prompt registries, model weights, sunset packages. The pricing manifesto covers commercial structure; how money flows for inference, milestones, discovery, upgrades, capacity, and labor mix. A defensible 2026 AI engagement signs both: commitments for what the buyer owns, manifesto for what the buyer pays. Either alone leaves a gap large enough to drive an invoice dispute through.
What is the simplest test to run on a candidate agency’s pricing model?
Ask three questions and time the answers. (1) “What was your inference markup on your last three engagements, and where in the SOW is it stated?” A defensible agency answers in under a minute. (2) “Show me a milestone payment that was withheld because the eval did not reproduce.” A defensible agency has at least one example and is comfortable describing it. (3) “What is your retainer termination policy in writing?” A defensible agency reads it back from a standard clause. If any of those takes a meeting, the operating model is not yet ready for an enterprise buyer.
Key takeaways
- Inference is the largest variable cost in 2026 AI engagements; pricing models that bundle it into a blended rate misprice the engagement and disincentivize the agency from making it cheaper.
- Eval-milestone billing converts AI delivery from a trust transaction into an arithmetic transaction, eliminating the most expensive failure mode: paying for a system that scored well in a demo and degrades against the production distribution.
- Fixed-fee discovery aligns the agency to an honest verdict in week one rather than to whatever proposal closes the build deal; “do not proceed” is a successful discovery outcome and should be paid as such.
- No-charge model upgrade testing turns the eval suite into a compounding asset across the frontier-model release cycle, removing the perverse incentive to skip upgrades the buyer would benefit from.
- Capacity-reservation retainers with a monthly kill clause replace the 2018 staff-aug lock-in model with a structure that prices optionality correctly and forces continuous quality of service.
- Transparent senior-vs-junior hour ratios eliminate the largest single source of buyer-side mistrust; the gap between the team that was sold and the mix that was billed.
- Sign this manifesto alongside the seven contractual commitments; together they define a defensible 2026 AI engagement.
Related reading
- The AI Agency Manifesto: What an AI Dev Partner Should Be in 2026; the pillar this pricing model anchors to.
- The 7 Commitments Most AI Dev Agency Should Make in Writing; the work-product and IP companion to this commercial framework.
- The Decline of the Staff-Aug AI Agency; why the 12-month-retainer model is mispriced for 2026.
- The AI Agency Tax: Why Most Engagements Waste 30% on Coordination; the operational counterpart to commercial structure.
- A Field Guide to Evaluating an AI Agency in Under 90 Minutes; pre-engagement diligence that pairs with this pricing conversation.
Arthur Wandzel