Both sides of an AI agency engagement are currently flying blind, and the demo-RFP-discovery dance does almost nothing to fix that. The buyer cannot tell the eval-disciplined operators from the deck-driven resellers; the agency cannot tell whether the buyer’s data is workable, whether the named champion is empowered, whether the problem is even the problem. So both sides perform a six-week courtship that produces a Statement of Work nobody fully believes in, and then they discover, in week eight of the main contract, what they should have learned in week one. The fix is not a better deck or a longer reference call. The fix is a paid pilot; one to two weeks, one eval-bound deliverable, one named senior engineer, a fixed budget between five and fifteen thousand dollars, and a kill clause on both sides; run before the main contract is signed.
The argument here is not that pilots are nice to have. It is that the main-contract-first model is structurally broken in 2026 and that the paid pilot is the artifact that replaces three older artifacts at once: the demo, the RFP, and the discovery phase. Each of those was load-bearing in 2018 and decorative by 2024; running them in 2026 is a coordination tax that produces no information either side can act on. A two-week paid pilot produces more decision-relevant evidence than three months of procurement theater, and it produces it for both sides simultaneously. That symmetry; both sides learning, both sides paying, both sides able to walk; is what makes it work. For the broader thesis on what an AI dev partner should be in 2026, see the AI agency manifesto.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Why the demo-RFP-discovery dance fails in 2026
A traditional procurement loop has three observable artifacts: the sales demo, the RFP response, and the discovery phase that opens the engagement. In 2018 each of those carried real information. The demo proved the agency had a product or capability; the RFP forced apples-to-apples comparison across vendors; discovery surfaced enough about the data and the org to make scoping defensible. None of that is true now.
Demos in 2026 are slop. Any agency with a Cursor subscription and forty hours of uninterrupted attention can ship a demo that looks production-grade against curated inputs. The signal in a demo collapsed somewhere around late 2024, and what’s left is a confidence sorting hat that selects for narrative skill rather than engineering discipline. RFPs are worse: they reward the firms with full-time proposal writers and penalize the boutiques with full-time engineers. The questions are written by procurement to be answerable, not informative; the answers are written by sales to be reassuring, not disprovable. By the time the RFP is scored on a weighted matrix, the matrix is measuring proposal hygiene rather than delivery capacity.
Discovery is the cruelest of the three because it bills time against an artifact that nobody uses. Four weeks of stakeholder interviews, journey maps, and a written discovery deck produce a Statement of Work; and the SOW is then immediately invalidated by the first contact with real data in week five. The data has the wrong shape, the workflow has an undocumented exception, the named user has retired, the model the SOW assumed is now deprecated. The team rewrites the plan in production while running on a contract that no longer reflects reality. For a sharper alternative to the four-week scoping dance, see the AI agency discovery week: a 5-day method that replaces 4-week scoping.
The deeper failure is informational asymmetry. The buyer knows the org chart, the politics, and the data’s lived weirdness. The agency knows the failure modes, the eval techniques, and the architecture decisions that survive contact with production. Neither side can credibly transmit their knowledge during procurement, and neither has the standing to demand the kind of access that would close the gap. The paid pilot is the structural answer: by paying real money for a small piece of real work, both sides earn the standing to learn the things the other knows.
The paid pilot brief
The pilot is not a watered-down version of the main engagement. It is a structurally different artifact with five non-negotiable components.
One eval-bound deliverable. Pick one specific problem with one specific output; an extractor, a classifier, a routing decision, a draft, a summary, an evaluator; and write the eval before the pilot begins. The eval is a set of twenty to fifty ground-truth examples drawn from real production cases, with explicit pass/fail criteria, a numeric threshold tied to a business outcome, and a current-baseline number. The pilot’s success condition is whether the agency moves the number from baseline to threshold inside the time-box. There is no separate discussion of “scope” because the eval is the scope.
A named senior engineer. Not a tech lead at fifteen percent allocation; not a roster of “available engineers” across the agency. One named human, on the kickoff call, full-time-or-meaningfully-close, accountable for the eval delta. The buyer should be able to look up that engineer’s GitHub, their last shipped PR, their writing on the tools they will use. If the agency cannot or will not name someone before money changes hands, the pilot is already failing; the agency is selling staffing optionality, which is exactly the agency model that pilots are designed to replace.
A fixed budget between five and fifteen thousand dollars. Five thousand for a one-week pilot with a tightly scoped extraction or classification problem; ten thousand for a two-week pilot that requires real architecture decisions; fifteen thousand if the data requires meaningful preparation and there is a custom eval-harness build inside the time-box. Fixed-price, paid in full at the start, no hourly true-ups. The fixed price disciplines both sides: the agency cannot pad, the buyer cannot scope-creep without a renegotiation, and the eval threshold is the single thing that determines whether the work is delivered.
A one-or-two-week time-box. Two weeks is the upper bound for almost most problem. One week is enough for problems with clean data and a well-known pattern. The time-box is hard. If the eval has not moved by end-of-week-two, the pilot ends; the agency does not get a third week to “almost get there.” Hard time-boxes are the only way to surface architectural mistakes early, and an architectural mistake surfaced in week two of a pilot is recoverable; the same mistake surfaced in week eight of a main contract is a write-off.
A kill clause on both sides. Either side can walk at the decline of the time-box, with no penalty and no obligation to proceed to the main contract. The kill clause is what makes the pilot informationally honest: the agency knows the buyer is genuinely shopping; the buyer knows the agency is genuinely auditioning; both sides know that the eval number, not the relationship vibe, is what determines next steps. A pilot without a real kill clause is a paid sales meeting, not a pilot.
The artifact at the decline of the pilot is the merged code, the eval-delta number, and a one-page memo on the data shape, the architecture choices made, the failure modes catalogued, and the recommended path for the main contract; including, sometimes, a recommendation against proceeding. For a deeper treatment of the eval-bound prototype shape, see the AI POC development 6-week sprint guide, which extends the pilot into a longer-form prototype when the eval delta justifies it.
What both sides learn
A pilot that is structured this way gives both sides information that no other procurement instrument can produce.
The agency learns the data shape. Real production data has a thousand small pathologies; schema drift, undocumented enum values, PII boundaries that nobody mapped, freshness windows that are wrong in the docs, encoding inconsistencies that only show up under load. The agency cannot scope a main contract honestly until they have spent ten engineering hours inside that data, and ten engineering hours inside the data is exactly what a paid pilot purchases. The agency also learns the political shape: who answers Slack messages, who blocks PRs, who has the authority to approve an eval threshold change. Those are the variables that make or break a six-month engagement, and they are invisible from the outside.
The buyer learns the agency’s eval discipline. Within the first three days of a pilot, the buyer sees whether the agency writes evals before code or after, whether the evals are committed to the buyer’s repo or held captive in a vendor environment, whether failure modes are catalogued in a real document or hand-waved away in standup. The buyer sees the agency’s debugging style, their reaction to a problem they did not anticipate, their honesty when the eval number does not move. None of that is visible in a demo. Many of it is visible in a pilot, and many of it is more predictive of main-contract success than any reference call. For a frame on what the modern alternative to procurement looks like, see the AI agency RFP is broken: here is what replaces it.
Both sides learn the architecture-decision rate. AI engagements live or die by how fast the team can make and revise architectural decisions: model choice, retrieval shape, eval cadence, the routing layer, the fallback strategy. A pilot forces five or six of those decisions into a two-week window and lets both sides see whether the agency reasons about them with rigor or with vibes. Buyers who have run two pilots can rank agencies by decision-quality with high confidence; buyers who have only seen demos cannot rank them at many.
What the pilot replaces
The structural argument is that the paid pilot replaces three older artifacts simultaneously, and that each replacement is a strict upgrade.
It replaces the sales demo because a real eval-delta on real data is dispositive in a way no demo can be. It replaces the RFP because the pilot answers the only RFP question that matters; can this team move this number on this data, with this engineer; directly, in code, instead of by proxy through procurement narrative. It replaces the four-week discovery phase because the pilot is the discovery: by end of week two, both sides know more about the data, the problem, and each other than any discovery deck has ever produced.
The pilot also replaces a category of failure mode that nobody likes to name. About thirty percent of main contracts that begin without a pilot end in a quiet impasse around month four, where the agency has billed faithfully and the buyer has nothing they can deploy. Both parties usually blame the other; the truer answer is that the engagement was scoped on assumptions that pilot-grade evidence would have invalidated. A two-week, fifteen-thousand-dollar pilot prevents most of those failures, which makes it one of the highest-ROI procurement instruments available to either side. The buyer who refuses to pay for a pilot because “we don’t pay for sales” is, in expected value, paying twenty times that amount in failed-engagement cost. The agency that refuses to run a pilot because “we don’t do free trials” is conflating a free trial with a paid one and signaling that they are not confident their work would survive the test.
A pilot is not a perfect signal. Some agencies are great at pilots and bad at long engagements; some buyers run pilots cynically as a way to extract free architecture without intending to sign. Those failure modes exist. They are still rarer than the failure modes of the demo-RFP-discovery loop, and they are easier to detect: an agency with a long history of pilots that did not convert is a signal in itself, and a buyer with a pattern of running and dropping pilots is identifiable inside a small market within two cycles. The paid pilot is not an oracle; it is just a much better instrument than what it replaces.
The simplest version of the rule
Run a paid pilot before signing the main contract. One eval-bound deliverable, one named senior engineer, fixed budget five to fifteen thousand dollars, one or two week time-box, kill clause on both sides. If the agency refuses, find another agency. If the buyer refuses, find another buyer. The pilot is the cheapest, fastest, most informationally honest instrument either side will ever run, and once you have run two of them you will not run a procurement loop any other way.
Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has run more than thirty paid pilots across portfolio companies and clients in the last 24 months, and has declined four main contracts on the basis of pilot findings.
Frequently Asked Questions
What is a paid pilot in an AI agency engagement?
A paid pilot is a one-to-two-week, fixed-price engagement run before the main contract is signed. It has five non-negotiable components: one eval-bound deliverable defined before the work begins, one named senior engineer accountable for the eval delta, a fixed budget between five and fifteen thousand dollars, a hard one-or-two-week time-box, and a kill clause that lets either side walk at the end with no obligation to proceed. The artifact at the end is merged code, a measured eval delta, and a one-page memo covering the data shape, architecture choices, failure modes, and a recommendation on the main contract.
How much should a paid AI pilot cost?
Five thousand dollars for a one-week pilot on a tightly scoped extraction or classification problem; ten thousand for a two-week pilot that requires real architecture decisions; fifteen thousand if the data needs meaningful preparation or a custom eval-harness build inside the time-box. The price is fixed, paid in full at the start, with no hourly true-ups. The fixed price is the discipline: the agency cannot pad, the buyer cannot scope-creep without a renegotiation, and the eval threshold is the single thing that determines whether the work is delivered.
Why does the paid pilot replace the sales demo, the RFP, and the discovery phase?
Demos in 2026 select for narrative skill rather than engineering discipline because any agency with a Cursor subscription can ship a curated demo. RFPs reward firms with full-time proposal writers and produce reassuring rather than disprovable answers. A four-week discovery phase produces a Statement of Work that gets invalidated by the first contact with real data. A paid pilot answers the one question that matters; can this team move this number on this data with this engineer; directly, in code, in two weeks, instead of indirectly through three months of procurement theater.
What does the agency learn during a paid AI pilot?
The agency learns the data shape; the schema drift, the undocumented enum values, the PII boundaries that nobody mapped, the freshness windows that are wrong in the docs, the encoding inconsistencies that only show up under load. They also learn the political shape: who answers Slack messages, who blocks pull requests, who has the authority to approve an eval threshold change. Those are the variables that make or break a six-month engagement, and they are invisible from outside the codebase. Ten engineering hours inside the data is what a pilot purchases.
What does the buyer learn during a paid AI pilot?
The buyer learns the agency’s eval discipline. Within three days the buyer sees whether the agency writes evals before code or after, whether the evals are committed to the buyer’s repo or held captive in a vendor environment, whether failure modes are catalogued in real documents or hand-waved away in standup. The buyer also sees the agency’s debugging style, their reaction to a problem they did not anticipate, and their honesty when the eval number does not move. None of that is visible in a demo or a reference call, and many of it is more predictive of main-contract success.
Why is a kill clause required on both sides of a paid pilot?
The kill clause is what makes the pilot informationally honest. It allows either side to walk at the decline of the time-box with no penalty and no obligation to proceed to the main contract. The agency knows the buyer is genuinely shopping; the buyer knows the agency is genuinely auditioning; both sides know that the eval number, not the relationship vibe, determines next steps. A pilot without a real kill clause is a paid sales meeting rather than a pilot, and it forfeits most of the informational value the structure was designed to produce.
What if the agency refuses to name a specific senior engineer?
If the agency cannot or will not name a specific senior engineer before money changes hands, the pilot is already failing and the buyer should walk. The agency is selling staffing optionality; the ability to swap engineers in and out based on internal capacity; which is exactly the agency model that paid pilots are designed to replace. The buyer should be able to look up the named engineer’s GitHub history, their last shipped pull requests, and their writing on the tools they will use during the pilot. Anonymous staffing is a 2024 service shape, not a 2026 one.
What if the eval number does not move during the pilot?
If the eval has not moved by end of the time-box, the pilot ends. The agency does not get a third week to almost get there. Hard time-boxes are the only way to surface architectural mistakes early; an architectural mistake surfaced in week two of a pilot is recoverable, but the same mistake surfaced in week eight of a main contract is a write-off. The pilot’s one-page closing memo explains why the number did not move; usually an architectural assumption rather than a prompt detail; and recommends either a different agency, a different problem framing, or a different time-box for a second pilot.
What objections do buyers and agencies raise to paid pilots, and are they valid?
Buyers sometimes refuse to pay for what they see as sales work. In expected value that refusal is expensive: roughly thirty percent of main contracts started without a pilot end in a quiet impasse around month four, which is twenty times the cost of a two-week pilot. Agencies sometimes refuse because they conflate a free trial with a paid one and frame pilots as a discount. The conflation is wrong: a paid pilot is full-rate work on a small problem, and the refusal usually signals that the agency is not confident their work would survive an eval-bound test. Both objections collapse on inspection.
Arthur Wandzel