Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 18 min read

The AI agency discovery week: a 5-day method that replaces 4-week scoping

The AI agency discovery week: a 5-day method that replaces 4-week scoping

A 5-day discovery week is not a sales artifact; it is a forcing function. Four-week AI scoping is a 2023 ritual that produces a deck, a SOW, and a JIRA backlog that nobody reads twice. Five-day discovery produces a problem narrative, a working eval rubric, a runnable prototype against sample data, an ADR, and an eval-bound proposal; many delivered to the buyer by Friday. Scoping done in decks is unfalsifiable; scoping done in artifacts is the same work performed honestly. This is the day-by-day method, the named outputs from each side, and the contrast with the broken alternative most agencies still sell.

Each day produces an artifact the next day consumes. By Friday afternoon, most assumption that would otherwise be buried in a four-week SOW has been written, tested against real data, and either confirmed or killed. This is the same discipline that the forward-deployed AI dev partner manifesto describes for the full engagement, applied to the pre-engagement window. The cadence below has replaced the four-week scoping cycle entirely at SFAI Labs and several peer firms across more than 30 kickoffs in the last 12 months.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Monday: problem narrative and stakeholder map

The week begins where most scoping engagements end: by writing down the problem in prose, not bullet points, and naming most human in the room. The Monday output is two artifacts; problem-narrative.md and stakeholder-map.md; both committed to a shared discovery repository before the day ends.

Morning, 90 minutes; the working session. The agency’s tech lead, a single senior engineer, and the buyer’s product owner plus domain expert convene for a structured working session. No slide deck. The output is a 600-to-900-word problem narrative that names the user, the workflow that exists today, the failure mode that triggered the project, the constraints that bound the solution, and the criterion under which the project is unambiguously a success. It is written live in a shared editor and committed at the end. If the agency cannot land a tight narrative in 90 minutes, the discovery week has surfaced its first finding: the buyer does not yet know what problem they are paying to solve.

Afternoon, 120 minutes; the buyer-side stakeholder map. The agency does not draw the buyer’s org chart; the buyer does. The map is a single page naming the economic buyer, executive sponsor, project owner, domain expert, engineering reviewer, security/compliance reviewer, data steward, and any downstream consumer of the system’s output. Each stakeholder has a role, a decision authority, and a calendar commitment for the next two weeks. This determines whether the proposal on Friday is approvable in the room it is presented in.

Agency commits to: the narrative draft, facilitating the session, and committing both artifacts before end of day. Buyer commits to: sending the right humans (engineers and domain experts, not just account contacts) and naming the decision-maker who will sign Friday’s proposal. Running discovery without a named decision-maker is how scoping turns into eight-week zombies.

Monday’s output is the foundation for everything that follows. Tuesday’s data audit is filtered against the narrative, Wednesday’s eval rubric is scored against the success criterion, Thursday’s ADR is justified against the constraints, and Friday’s proposal lives or dies by whether it satisfies the criterion.

Tuesday: data audit and failure-mode catalog from sample data

Tuesday is when the discovery week stops being a conversation and starts being an investigation. The agency’s senior engineer, working with the buyer’s data steward, conducts a structured walkthrough of most data source the system will touch and produces two named artifacts: data-audit.md and failure-modes.md. Both are anchored to a representative sample of real data; not described in the abstract, not pulled from a template.

Morning; the data audit. Each source is characterized on five dimensions: source of truth, schema and freshness, volume and access pattern, PII boundary and retention, and observed quality on a sample. The format is brutally specific. “Stripe webhooks land in BigQuery within 90 seconds, schema versioned in dbt, 12 months retention, contains email plus last-4 of card, sample of 5,000 events shows 0.3 percent malformed payloads” beats “Stripe data is available.” Five sources characterized to this depth beat 25 sources listed as bullet points, and writing the audit narrows the proposal scope by forcing most data dependency to be named.

Afternoon; the failure-mode catalog from sample data. The senior engineer takes 50 to 200 representative inputs from the buyer’s actual production data; anonymized as needed; and runs them through a stub or baseline. Each failure is tagged: hallucination, retrieval drift, schema mismatch, latency, prompt injection, silent model drift, missing context, downstream breakage. The output is a table of the top 10 to 15 failure categories ranked by frequency and severity. Failure modes you find on sample data Tuesday are the same modes that will hit production in month three; finding them now lets the proposal price them in.

Agency commits to: running the sample through working code (not slideware) and tagging failures honestly. Buyer commits to: sample data access by 9 a.m. Tuesday; large enough to surface failure modes, scoped to a single use case to keep the audit tractable. An AI scoping engagement that runs without sample data is the canonical pattern that produces a four-week deck and a project that fails in production.

By Tuesday end of day the engagement has converted abstract risk into observed failure modes. The four-week scoping pattern rarely reaches this inflection point because it does not look at data until week 5 or 6.

Wednesday: eval rubric draft and the first prototype against sample data

Wednesday is when discovery becomes provable. By the decline of the day there is a written eval rubric and a runnable prototype, both committed to the discovery repository, and a number; the prototype’s score against the rubric; that the proposal on Friday will use as its starting point.

Morning; the eval rubric draft. Three parts. The test set is 30 to 60 ground-truth examples drawn from Tuesday’s sample data, each with an expected output or explicit acceptance criterion. The scoring criteria define pass and fail, with sub-criteria for partial credit. The thresholds set the numeric bar for each criterion; typically a primary threshold tied to Monday’s success criterion, plus secondary thresholds for safety, latency, and cost. The rubric is committed as evals/rubric-draft.md plus a runnable evals/run.py. A 60-example rubric is not the full eval suite, but it is enough to ground Friday’s proposal in measurement rather than promise.

Afternoon; the first prototype. Working from the narrative, the data audit, and the failure-mode catalog, the senior engineer ships a runnable prototype that produces an output for each test-set example. Ugly is fine. Hardcoded values are fine. Mocked retrieval is fine. What is not fine is a prototype that does not run, or one that runs but is not wired to the rubric. Two artifacts ship: the prototype on a feature branch, and evals/baseline-results.md with the score against each criterion. Typical Wednesday-afternoon scores land in the 0.4-to-0.6 range; well below the threshold the buyer cares about. That is the correct shape; Friday’s proposal is the path from that number to the threshold.

Agency commits to: drafting the rubric, building the prototype, and committing the result honestly even when ugly. Buyer commits to: an eval-rubric review Thursday morning where the domain expert validates the test set, scoring criteria, and thresholds. If the domain expert cannot review, the proposal cannot be eval-bound, and the engagement collapses back to opinion-trading.

The rubric is the single most important artifact of the week. It converts the proposal from a wishful deliverables list into an instrument the buyer can use to verify whether the engagement is on track most week. The AI consulting discovery phase guide covers the rubric patterns we use across domains.

Thursday: architecture decision record and senior engineer match

Thursday converts discovery into a designed system. The output is an ADR; adr/0001-system-architecture.md; that names most architectural choice with a two-sentence justification, plus the named senior engineer who will own the build if the engagement closes. By Thursday night the buyer knows what they are buying, who is building it, and what the build will look like.

Morning; the eval-rubric review. A 60-minute session where the buyer’s domain expert and engineering reviewer walk through Wednesday’s rubric. They challenge test cases, tighten scoring criteria, and confirm or adjust thresholds. The version that survives becomes evals/rubric.md and is what the proposal is bound to.

Afternoon; the ADR. A single ADR covers eight choices: model selection, the routing/abstraction layer (LiteLLM, framework-native, or custom), retrieval strategy (chunking, freshness, reranking, or none), the tool-call boundary, caching, fallback when the primary provider is degraded, the observability stack, and the cost ceiling per request and per month. Each choice is named and justified in two sentences, grounded in an artifact from the previous days: failure modes and sample data shape from Tuesday, threshold gaps from Wednesday’s baseline. The ADR is reviewed in a 60-minute architecture session with the buyer’s engineering reviewer; surviving choices are committed, the rest flagged for a follow-up ADR before the proposal is finalized.

Late afternoon; the senior engineer match. The agency names the senior engineer who will own the build if the engagement closes, and that engineer joins a 30-minute introduction with the buyer’s engineering reviewer. This is not a sales gesture; it is the answer to “who is the human accountable for this system in production.” Agencies that staff discovery with engineers different from the build team are running a sales-to-delivery handoff that produces predictable failure in week 3. The discovery week’s senior engineer is the build engineer; if not, the buyer knows on Thursday rather than month two.

Agency commits to: writing the ADR, defending each choice, and naming the build engineer. Buyer commits to: an engineering reviewer who can challenge architectural choices on technical grounds, and an honest read on whether the senior engineer is a fit.

By Thursday end of day, the week has produced a narrative, a data audit, a failure-mode catalog from real sample data, an eval rubric, a baseline number, an ADR, and a named senior engineer. Friday’s proposal assembles these into a contract.

Friday: the eval-bound proposal with kill clause

Friday is the proposal. Not a deck, not a SOW with hand-wavy language about “scope” and “deliverables.” A document of three to five pages that is bound to the rubric, priced against a defined scope, and equipped with a kill clause that lets either side exit cleanly if the engagement is not delivering against the rubric.

Morning; proposal drafting. The agency assembles the week’s artifacts. The eval-bound scope names the rubric thresholds the engagement is committing to hit, the timeline (typically 6 to 12 weeks for the first threshold band), the artifacts shipped along the way, and the staffing. Pricing is fixed-fee or capped-T&M against the eval-bound scope, with an explicit cost ceiling per request and per month from Thursday’s ADR. The kill clause names the eval delta required by a defined checkpoint (typically week 4) and the exit terms if not met; usually a paid-to-date settlement and IP transfer of the artifacts. The proposal references the first 14-day engagement shape as the operating cadence that follows.

Afternoon; the proposal presentation. A 60-minute session with the economic buyer, engineering reviewer, and project owner. The agency walks through each artifact in order; narrative, data audit, failure modes, rubric, baseline, ADR, senior engineer; and lands on the proposal. There is no surprise; most claim is grounded in an artifact the buyer has already seen. The conversation is structured around the rubric thresholds, the kill clause, and the cost ceiling, not around scope ambiguity. Decisions in the room are common because discovery has already converged on specifics.

Agency commits to: producing the proposal in writing, defending it on technical grounds, and accepting the kill clause as a real exit. Buyer commits to: a yes-or-no decision within five business days from the decision-maker named Monday.

Friday produces one of three outcomes. The proposal is accepted and the engagement starts Monday on the 14-day operating cadence. The proposal is rejected and both sides walk away with the artifacts, which the buyer can use to run the same conversation with another agency. The proposal is renegotiated against a tightened scope or threshold and the agency re-presents within 48 hours. None of these outcomes is a four-week deck.

The broken alternative

Most AI agencies still sell the four-week scoping cycle. Week 1 is a kickoff workshop. Week 2 is a discovery sprint of stakeholder interviews. Week 3 is a stakeholder alignment workshop and a draft SOW. Week 4 is a discovery deck, a JIRA backlog, and a proposed engagement. There is no working code, no eval rubric, no ADR, and no failure-mode catalog from real data. The proposal is built on assertion rather than measurement, and the engagement that follows is unscopable in the same way the discovery was.

The four-week shape was already weak in 2024 and is malpractice in 2026. The tooling is mature, the patterns are codified, the failure modes are well understood. Five-day discovery is not aggressive; it is the baseline. Agencies that need four weeks to write a deck are signaling either that they do not know how to characterize a problem in code, or that they have a billing model that depends on slow starts.

The bridge from four-week scoping to five-day discovery is the eval rubric on Wednesday. Once a rubric exists, most subsequent day is forced into measurement. The proposal is bound to the rubric. The engagement that follows is bound to the rubric. The kill clause is bound to the rubric. There is a number, the number is moving, and the engagement either makes it move or does not. Decks cannot fake an eval delta. Workshops cannot fake a runnable prototype. The artifacts are either there by Friday or they are not, and five days is enough to tell.


Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has run more than 30 5-day discovery weeks across portfolio companies and clients in the last 12 months.

Frequently Asked Questions

What is an AI agency discovery week?

A 5-day discovery week is a compressed scoping method that replaces the traditional 4-week scoping cycle. Each day produces a named artifact: Monday delivers a problem narrative and stakeholder map, Tuesday a data audit and failure-mode catalog from real sample data, Wednesday a draft eval rubric and a runnable prototype, Thursday an architecture decision record and a named senior engineer, and Friday a full eval-bound proposal with kill clause. By Friday afternoon most assumption that would otherwise be buried in a four-week SOW has been written down, tested against real data, and either confirmed or killed.

Why replace 4-week AI scoping with a 5-day discovery week?

Scoping done in decks is unfalsifiable; scoping done in artifacts is the same work performed honestly. The 4-week cycle produces a kickoff workshop, a discovery sprint of stakeholder interviews, an alignment workshop, and a discovery deck with a JIRA backlog. There is no working code, no eval rubric, no ADR, and no failure-mode catalog from real data. The proposal is built on assertion rather than measurement, and the engagement that follows is unscopable in the same way the discovery was. Five-day discovery converts the same scoping work into runnable artifacts that ground the proposal in a number rather than a promise.

What does the agency commit to during discovery week?

Named outputs, on schedule, committed to a shared repository: a 600-to-900-word problem narrative on Monday, a five-dimension data audit and a failure-mode catalog from real sample data on Tuesday, a 30-to-60-example eval rubric and a runnable prototype scored against the rubric on Wednesday, a single ADR covering eight architectural choices and a named senior engineer who will own the build on Thursday, and a full eval-bound proposal with pricing and kill clause delivered to the buyer on Friday. The agency also commits to staffing discovery with the engineers who will own the build, not a separate sales team.

What does the buyer commit to during discovery week?

Three concrete commitments. First, sample data access by 9 a.m. Tuesday; a dataset large enough to surface failure modes but scoped to a single use case. Second, a domain expert who reviews the eval rubric on Thursday morning to validate the test set, scoring criteria, and thresholds. Third, the named decision-maker who can sign Friday’s proposal, present in the room and committed to a yes-or-no decision within five business days. If the buyer cannot meet any of these, discovery pauses until they can; running discovery without sample data, an eval reviewer, or a decision-maker is the canonical pattern that produces a four-week deck and a project that fails in production.

What is in the eval rubric draft on Wednesday?

Three parts. The test set is 30 to 60 ground-truth examples drawn from Tuesday’s sample data, each with an expected output or explicit acceptance criterion. The scoring criteria define what counts as pass and fail, with sub-criteria for partial credit. The thresholds set the numeric bar; typically a primary threshold tied to the success criterion in Monday’s narrative, plus secondary thresholds for safety, latency, and cost. The rubric is committed alongside a runnable harness that scores the prototype against each example. A 60-example rubric is not the full eval suite, but it is enough to ground the proposal in measurement rather than promise.

What goes in the architecture decision record on Thursday?

A single ADR covers eight choices: model selection, the routing or abstraction layer (LiteLLM, framework-native, or custom), retrieval strategy (chunking, freshness, reranking, or none), the tool-call boundary, caching strategy, fallback strategy when the primary provider is degraded, the observability stack, and the cost ceiling per request and per month. Each choice is named and justified in two sentences, grounded in the artifacts produced earlier in the week: failure modes from Tuesday, sample data shape from Tuesday, and threshold gaps from Wednesday’s baseline score. The ADR is reviewed in a 60-minute architecture session with the buyer’s engineering reviewer, and surviving choices are committed before the proposal is finalized.

What is the kill clause in the Friday proposal?

The kill clause names the eval delta the engagement must achieve by a defined checkpoint, typically week 4 of the build, and the exit terms if the delta is not met. Standard exit terms are paid-to-date settlement plus IP transfer of many artifacts produced through the checkpoint. The clause is bound to the rubric agreed on Thursday morning, so the trigger is unambiguous: if the prototype’s score against the rubric has not moved by the agreed delta by week 4, either side can exit cleanly. The kill clause is what makes the proposal eval-bound rather than a wishful list of deliverables; it converts the contract into an instrument the buyer can use to verify whether the engagement is on track.

Should the senior engineer at discovery be the same engineer who builds the system?

Yes. The discovery week’s senior engineer is the build engineer. Agencies that staff discovery with engineers different from the build team are running a sales-to-delivery handoff that produces predictable failure in week 3 of the engagement, when the context built up in discovery is lost in the transition. The Thursday senior engineer match is the answer to ‘who is the human accountable for this system in production,’ and it is presented to the buyer’s engineering reviewer in a 30-minute working session. If the agency cannot name the build engineer on Thursday or insists on a ‘transition’ to a different engineer post-signing, the engagement is selling a 2024 service in 2026 packaging.

What outcomes can the Friday proposal session produce?

Three outcomes are possible. The proposal is accepted and the engagement starts the following Monday on the 14-day operating cadence. The proposal is rejected and both sides walk away with the artifacts produced during the week, which the buyer can use to run the same conversation with another agency at a much higher fidelity than they could before discovery began. Or the proposal is renegotiated against a tightened scope or threshold, and the agency revises and re-presents within 48 hours. None of these outcomes is a four-week deck, and many three are good outcomes for the buyer because the artifacts have already converted abstract risk into observed measurement.

How does discovery week fit into the broader engagement cadence?

Discovery week is the prequel to the engagement, not a smaller version of it. If the proposal is accepted on Friday, the engagement starts the following Monday on the 14-day operating cadence: a working kickoff and eval baseline by day 2, a data audit and failure-mode catalog by day 5, an ADR and prototype by day 9, and a merged eval-gated PR by day 14. The discovery week artifacts feed directly into the engagement: the rubric becomes the engagement’s eval suite, the ADR becomes the engagement’s architecture, and the senior engineer becomes the engagement’s tech lead. By the decline of week 3 (one discovery week plus two engagement weeks), there is a feature merged through an eval gate into production.

Last Updated: May 24, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles