Feature-list scopes are how AI projects fail in slow motion. A feature is binary: either you implemented “AI summarization” or you did not. AI quality is continuous: a summarizer can pass at 60 percent faithfulness, ship under a green status, and quietly destroy user trust six weeks later. The mismatch between a binary scoping vocabulary and a continuous quality reality is the single most expensive defect in 2026 AI procurement, and almost most statement of work I read still contains it. The fix is not better feature lists. The fix is to stop writing feature lists and start writing eval suites.
This article makes the case that the unit of scope for an AI engagement should be a passing eval at a defined threshold, not a checkbox next to a capability name. It draws on the broader argument made in the AI agency manifesto, and it pairs with the operational mechanics described in the AI model evaluation and testing services guide and the AI testing and QA process guide. If you are writing a procurement document this quarter, this is the part of the playbook to internalize before you put pen to paper.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Why feature-scoping fails for AI work
Feature-scoping is a vocabulary borrowed from deterministic software, where it works well. “Implement OAuth login” has a clean acceptance test: the user logs in, or they do not. “Implement a quarterly billing report” either renders the right numbers or does not. Functional requirements are observable, repeatable, and binary. The contract closes when the box is ticked.
AI features do not behave like that. “Implement AI summarization” can be checked off the moment the model returns a non-empty string. The same feature can score 0.42 on a faithfulness eval against the source document, hallucinate a citation in 12 percent of outputs, and produce summaries longer than the original document on inputs above 40K tokens. None of those defects are visible in a feature-acceptance vocabulary. The agency ships, the invoice clears, and the regression is somebody else’s problem in week eight.
The vocabulary mismatch shows up in four predictable ways:
1. Binary acceptance hides a continuous quality distribution. Feature-completeness implies a step function. AI output is a distribution. When you scope by feature, you accept whatever point on the distribution the agency happens to ship.
2. Feature lists ignore the failure modes that break trust. “AI Q&A” is one feature. “AI Q&A that refuses when the source documents do not contain the answer” is a different feature. “AI Q&A that cites the source paragraph for most claim” is a third. A feature-list scope rolls many three into one bullet, and only one of them is a useful product.
3. Procurement cannot tell two finished systems apart. Two agencies can both deliver “AI summarization” and ship systems that differ by 30 points on faithfulness. The buyer learns about it from a customer complaint at the worst possible moment.
4. Change control breaks. When the underlying model is silently updated by the provider, a feature-list scope has no mechanism to detect or enforce the resulting quality drift. The feature is still “implemented.” The product is materially worse.
These are the dominant outcomes of feature-scoped engagements I have audited in the last 18 months; consistent enough that I treat a feature-list SOW as a leading indicator of project failure.
The core mismatch: binary scope, continuous quality
The cleanest way to see the problem is to write the same scope two ways and compare them.
The feature-list scope:
- Implement AI summarization for uploaded documents
- Implement AI Q&A against the document corpus
- Implement AI search across the knowledge base
This is the scope I see in roughly 70 percent of inbound RFPs. Three bullets. Three checkboxes. No way to tell whether the delivered system is good or terrible without running it in production and waiting for users to tell you.
The eval-list scope:
- Pass the internal domain knowledge test suite (240 questions across 12 topic areas) at 0.85 accuracy, with 0.90 accuracy on the safety-critical subset
- Pass the cited-source retrieval eval at recall@5 of 0.90 and precision@5 of 0.75 against the gold-standard corpus
- Pass the faithfulness eval at 0.92 (claims supported by retrieved sources, judged by a separate rubric-based evaluator)
- Pass the latency eval at P95 under 800ms for retrieval and P95 under 4 seconds for generation, measured against the production traffic mix
- Pass the cost eval at a per-query unit economics bound of $0.018, measured over the eval suite traffic profile
- Pass the refusal eval at 0.95 on out-of-corpus questions (the system declines to answer rather than hallucinate)
This is six bullets instead of three, and most one of them is falsifiable. You can run the eval suite on the delivered system and produce a single number that says “shipped” or “did not ship.” You can rerun the same suite three months later when the provider silently updates the model and detect drift before it reaches users. The feature names; summarization, Q&A, search; do not appear in the contract, because they are not the thing the contract is buying. The contract is buying a system that passes the evals.
The eval-list scope is harder to write. It requires the buyer to articulate what good looks like, in numbers, before the work begins. That is exactly the work feature-list scopes were invented to avoid. Avoiding it is what produces the failure mode in the first place.
What an eval-scoped statement of work contains
A working eval-scoped SOW has six parts. The discipline is in writing them down and signing them as the contract.
1. The eval suite, by name. The contract names specific evals; domain accuracy, faithfulness, retrieval recall, latency, cost, refusal, safety, format adherence; and their version. The suite lives in the buyer’s repository, not the agency’s. New evals are added by amendment, not by Slack.
2. The threshold for each eval. Most eval has a number, set by the buyer’s domain expert with a written rationale tying it to a business outcome. “0.85 because most legal-clause classifiers in production sit between 0.80 and 0.92” is defensible. “0.90 because round numbers feel right” is not.
3. The traffic profile the eval suite represents. Evals are sampled from somewhere. The contract specifies the sampling source; production logs from a comparable system, a synthesized distribution from a labeled domain expert, a public benchmark; and what it claims to represent. A passing eval against the wrong traffic profile is worse than no eval at many.
4. The pass/fail mechanism. CI runs the suite on most PR. The agency cannot self-grade in a notebook the night before the demo. The buyer can rerun the suite on demand.
5. Change-control for the suite itself. Evals are versioned. The contract specifies how new evals are added (agency proposes, buyer approves, threshold set jointly), how thresholds are raised (quarterly review), and what happens if the suite is materially changed mid-engagement.
6. The remedy when the suite fails. Eval-scoped contracts say: the milestone is not paid, and the engagement extends until the suite passes. Feature-scoped contracts have no answer to this question, which is precisely why the failure mode persists.
A worked example: the document-Q&A engagement
Take a concrete case. A mid-market legal-tech company wants to ship document Q&A against their client’s contract repository. The feature-list scope reads:
- Implement document Q&A
- Implement source citation
- Implement document upload pipeline
- Achieve “production quality”
Three of the four bullets are real engineering. The fourth is the one that matters, and it is the one that gets ignored. “Production quality” is the load-bearing term, and it has no operational definition. Nine months in, the system has 87 percent user satisfaction in usability tests and a 14 percent hallucination rate on the legal-clause subset. The product is not shippable to clients in regulated industries. The agency points at the SOW and says they implemented document Q&A. They are not wrong.
Now write the same engagement with an eval-scoped SOW:
- Pass the legal-clause classifier suite (built from 800 anonymized clauses across 14 contract types, labeled by an in-house attorney) at 0.88 accuracy, with 0.94 accuracy on the safety-critical subset (indemnity, liability, IP assignment)
- Pass the citation-grounding eval at 0.95 (most cited paragraph contains the claim, judged by an attorney-rated rubric)
- Pass the refusal eval at 0.92 on out-of-corpus questions (the system declines rather than hallucinates)
- Pass the retrieval eval at recall@10 of 0.92 against the gold-standard answer paragraphs
- Pass the latency eval at P95 under 6 seconds for end-to-end response on the 95th-percentile contract length
- Pass the cost eval at a per-query unit economics bound of $0.04, with a budget alarm at $50K per month
- Pass the safety eval at zero hallucinated case citations across the 200-question adversarial set
The first scope ships in nine months and is not deployable. The second scope ships in eleven months and is deployable on day one. The two extra months are the actual cost of the project. The first scope hides that cost in a future quarter, where it is paid in user trust and a product manager’s career rather than in cleanly itemized engineering hours. The second scope pays the cost on the way in, in the open, and produces a system that the buyer can rerun the suite against in perpetuity.
Where eval-scoping changes the buyer’s behavior
Eval-scoping is not a unilateral move by the agency. It changes what the buyer has to do, and that is part of why feature-list scopes persist; they let buyers avoid the part of their own job that eval-scoping forces them to do.
The buyer has to bring a domain expert to the kickoff; a person who can look at 50 model outputs and tell the agency which are right, wrong, or ambiguous. Without that person, the eval suite cannot be calibrated.
The buyer has to commit to a threshold before the work starts. This is the conversation feature-list scopes are designed to avoid, because it is uncomfortable. It surfaces the difference between what the buyer wants and what the budget supports. That conversation is much cheaper to have on day one than on day 270.
The buyer has to maintain the eval suite after handoff. The eval suite is not a deliverable; it is an asset the buyer operates indefinitely, the same way they operate their test suite for deterministic software. Buyers not prepared to staff for this should not ship AI features at many. The AI agency RFP guide covers the procurement-side mechanics that make this enforceable.
The objection from the agency side and why it is mostly wrong
Agencies that resist eval-scoping usually raise one of three objections. Many three are common, and many three are mostly wrong.
“We cannot commit to a number before we have explored the data.” True for the first two weeks. False as a permanent posture. Eval-scoped engagements I have run set provisional thresholds at kickoff, calibrate them in a paid two-week discovery phase, and lock them in by the start of the production build. An agency that refuses to set a threshold even after a discovery phase is telling you they cannot predict their own quality.
“Evals are expensive to build and we should not pay for them out of fixed-price work.” Half-true. Building the initial suite is real engineering work, and it should be a separate, transparent line item; not bundled or hidden. But once it exists, the eval suite is the cheapest insurance the buyer ever buys, and the cost amortizes over most future change to the system. The agency arguing against this line item is usually arguing against the buyer’s ability to verify their work after handoff.
“Our methodology already covers this.” Sometimes true, mostly false. Ask to see the eval suite from a recently shipped engagement. Ask to read five test cases. Ask who set the threshold and why. The questions in the field guide to evaluating an AI agency in 90 minutes work directly here. The agencies that cannot answer those questions are the ones whose methodology does not in fact cover this; their internal practice has no operational handle on what good looks like.
The agencies that lean into eval-scoping are the ones whose engineering discipline already operates this way. They have the suites, they have the thresholds, they have the post-mortems where a failed eval blocked a release. They are the agencies you want, and eval-scoping is the cleanest signal in your procurement process for surfacing them.
What eval-scoping does not solve
Eval-scoping is not a complete answer. It has limits worth naming. It does not solve the problem of evals you cannot afford to write; adversarial safety, long-tail edge cases, real-user-distribution sampling are many hard and expensive to build well. The discipline must include “what the suite covers and what it does not,” in writing, on the contract.
It does not solve the problem of subjective quality dimensions. Tone, helpfulness, brand voice; use rubric-based human review, accept that the eval is noisier, and weight it appropriately. Leaving them out of scope means discovering them in user complaints.
It does not replace production observability. A system can pass a 200-case eval and fail catastrophically on the 100,001st real-user query. The eval suite is a gate, not a guarantee. Production telemetry and a feedback loop that turns production failures into new eval cases is the second leg of the discipline. Tooling that closes this loop; Promptfoo, LangSmith, Braintrust, the OpenAI evals API, Anthropic’s eval tooling; should appear in any serious 2026 architecture.
The procurement question to ask before you sign anything
There is a single question that surfaces eval-scoping discipline faster than any other, and it belongs in most AI agency conversation before contracts are drafted: “What is the eval suite this engagement will be measured against, and what is the passing threshold for each eval?”
If the agency answers with a feature list, you are about to sign a feature-scoped contract regardless of what the SOW heading says. If the agency answers with a list of evals, threshold numbers, and the rationale for each, you are talking to an engineering team that operates the way 2026 AI work has to be operated. The difference shows up not in the marketing materials but in the first 60 seconds of the answer to that one question.
Stop scoping AI projects in features. Scope them in evaluations. The features will get implemented anyway; they are the trivial part. The evals are the part that determines whether the system you ship is the one your users can trust, and that is the only part of the scope that has ever mattered.
Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has audited more than 40 AI procurement documents in the last 18 months and has yet to find a feature-scoped SOW that did not produce predictable downstream pain.
Frequently Asked Questions
Why do feature-list scopes fail for AI projects?
Feature-list scopes assume binary acceptance: a feature is either implemented or not. AI quality is continuous: a feature can be ‘implemented’ at 60 percent quality and ship under a green status while breaking user trust. Feature-list vocabulary has no contractual handle on the long tail of AI output, no language to distinguish two finished systems that differ by 30 points on faithfulness, and no mechanism to detect quality drift when a provider silently updates the underlying model. The result is predictable: nine months in, the feature is checked off but the product is not deployable, and the buyer has no remedy.
What does eval-driven AI scoping mean in practice?
Eval-driven scoping replaces feature checkboxes with eval thresholds as the unit of acceptance. Instead of ‘implement AI summarization,’ the contract says ‘pass the faithfulness eval at 0.92, pass the latency eval at P95 under 4 seconds, pass the cost eval at $0.018 per query.’ Most line item is falsifiable. The eval suite lives in the buyer’s repository and is rerun by CI on most change. The milestone is paid when the suite passes, not when a feature is demoed.
What should an eval-scoped statement of work contain?
Six parts: the named eval suite (domain accuracy, faithfulness, retrieval, latency, cost, refusal, safety, format), the threshold for each eval with a written rationale, the traffic profile the suite represents, the pass/fail mechanism (CI on most PR, buyer can rerun on demand), the change-control mechanism for adding evals or raising thresholds, and the remedy when the suite fails (the milestone is not paid and the engagement extends until passing). Many six in writing, signed, before any code is written.
Can you give a concrete example of feature-scoping versus eval-scoping?
Feature-list scope: ‘implement AI summarization, AI Q&A, AI search.’ Eval-list scope for the same project: pass the domain knowledge test at 0.85 accuracy with 0.90 on safety-critical questions, pass cited-source retrieval at recall@5 of 0.90, pass faithfulness at 0.92, pass latency P95 under 800ms for retrieval and under 4 seconds for generation, pass per-query cost under $0.018, pass refusal eval at 0.95 on out-of-corpus questions. The feature scope tells you nothing about whether the system is good. The eval scope tells you exactly when to ship and exactly when to refuse to.
Who sets the eval thresholds in an eval-scoped engagement?
The buyer’s domain expert sets thresholds in collaboration with the agency, with a written rationale tying each threshold to a business outcome. The buyer must bring an actual domain expert to the kickoff, not a project manager or procurement lead. Thresholds are typically calibrated during a paid two-week discovery phase, then locked in by the start of the production build. An agency that refuses to commit to a threshold even after discovery is signaling they cannot predict their own quality.
How do agencies typically resist eval-scoping and which objections are valid?
Three common objections: ‘we cannot commit before exploring the data’ (true for the first two weeks, false as a permanent posture), ‘evals are expensive to build’ (half-true; the build is real engineering and should be a transparent line item, but once built the suite is the cheapest insurance the buyer ever buys), and ‘our methodology already covers this’ (mostly false; ask to see the eval suite from a recently shipped engagement and read five test cases). Agencies that lean into eval-scoping are the ones whose engineering practice already operates this way.
What does eval-scoping not solve?
Three limits. First, evals you cannot afford to write; adversarial safety, real-user-distribution sampling; are hard and expensive, and a thin suite is gameable. Second, subjective quality dimensions like tone and brand voice need rubric-based human review, which is noisier. Third, evals are gates, not guarantees: a system can pass a 200-case suite and fail on the 100,001st query. Production observability and a feedback loop that turns failures into new eval cases are the second leg of the discipline.
What single question reveals whether an agency operates with eval discipline?
‘What is the eval suite this engagement will be measured against, and what is the passing threshold for each eval?’ If the agency answers with a feature list, you are about to sign a feature-scoped contract regardless of what the SOW heading says. If the agency answers with named evals, threshold numbers, and the rationale for each, you are talking to an engineering team that operates the way 2026 AI work has to be operated. The difference shows up in the first 60 seconds of the answer.
Does the buyer need to maintain the eval suite after the engagement ends?
Yes. The eval suite is not a deliverable; it is an asset the buyer owns and operates indefinitely, the same way they operate their test suite for deterministic software. When the model provider silently updates a model, the eval suite is the only mechanism that detects quality drift before users do. If the buyer is not staffed to maintain the suite; to add new cases when production failures surface them, to update thresholds as expectations rise; they are not staffed to ship AI features at many.
Arthur Wandzel