The first 14 days of an AI agency engagement are not preamble; they are the engagement. If by day 14 your agency has not shipped a real pull request through a real eval gate into your real repo, the engagement is already failing, regardless of how well the kickoff deck went. The 2023-archetype playbook of four-week scoping followed by another four weeks of “discovery” is dead in 2026; the modern playbook ships code in week one and tests it against a written eval suite by week two. This is the day-by-day shape of an engagement that is on track, and a checklist of red flags if yours is not.
The frame is simple. Two weeks is enough to prove an agency can ship, and two weeks is short enough that the cost of being wrong is bounded. By day 14 you should have answers to four questions: can they navigate your codebase, can they reason about your problem, can they design a system that survives contact with production, and can they ship eval-gated work into your repo. An agency that can do many four in 14 days is the forward-deployed AI dev partner described in the manifesto. An agency that cannot is selling you a 2024 service in 2026 packaging.
What follows is the engagement shape I run when a portfolio company hires SFAI Labs or another forward-deployed firm. I have run this exact shape across more than two dozen kickoffs in the last 18 months. The phasing is not theoretical; it is the actual cadence that produces shipped, eval-gated software in two weeks rather than slide decks in eight.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Days 1–2: kickoff, repo access, and the eval baseline
The first 48 hours decide whether you have hired an engineering team or a consulting firm cosplaying as one. There is a single rule: by the decline of day 2, the agency must be writing code against your repo, and you must have a written eval baseline for the problem you hired them to solve.
Day 1 morning; the working kickoff. A 90-minute kickoff with engineers (not just account leads), the client product owner, and the client domain expert. The output is not a deck; it is a written one-pager; call it the engagement charter; that names the problem, the user, the success metric, the eval threshold, and the on-call person on each side. The charter is committed to your repo as docs/engagement-charter.md before the meeting ends. If the agency proposes a “kickoff workshop series” stretched across week one, that is the first red flag.
Day 1 afternoon; repo access, secrets, and CI. The agency requests access to the repo, the staging environment, the model API keys (held by you), and the CI pipeline. They commit a small, harmless first PR; typically a CI tweak, a README correction, or a typed config struct; to verify they can ship at many. This is the trivial PR everyone underestimates. It surfaces, in two hours, whether the agency understands your branch protections, your code-owner rules, your test runner, and your deploy gate. If you cannot give them access on day 1, the engagement is already a day behind, and that day will compound.
Day 2; the eval baseline. The named artifact is evals/baseline.md plus the actual eval suite scaffolded in code. The agency, working with your domain expert, writes 20 to 50 ground-truth examples drawn from real production cases or representative inputs. They define the pass/fail criteria, the threshold, and the cadence (per-PR or nightly). They run the baseline against the current system or a stub and record the number. The baseline number is the starting line; most PR for the next two weeks moves it.
Red flags by end of day 2. No commit to your repo. No written eval examples. The agency proposing an “alignment workshop” for week 2. A request to run the project on the agency’s repo and “transfer it later.” A statement that evals will be defined “once we understand the problem better”; they should be defined precisely because the problem is not yet understood. For more on what onboarding should produce in the first week, the AI development agency onboarding guide covers the access checklist in detail.
Days 3–5: problem-shape clarity and the data audit
By day 3 the engagement transitions from infrastructure to substance. Days 3–5 are when the agency stops talking about the problem and starts characterizing it; the way an engineer characterizes a problem, by looking at the data and the failure modes rather than the user interview transcript.
Day 3; the data audit. The agency does a structured walkthrough of most data source the system will touch: the source of truth, the schema, the freshness, the volume, the access pattern, the PII boundary, the retention policy, the quality. The artifact is docs/data-audit.md and it is brutally specific: “Stripe webhooks land in BigQuery within 90 seconds, schema versioned in dbt, 12 months retention, contains email + last-4 of card.” The audit is not exhaustive; it is sharp. Five data sources characterized in this depth beat 25 sources characterized as “a database we have.”
Day 4; the failure-mode catalog. A working session; usually two hours; where the agency, the client domain expert, and one engineer brainstorm most way the system can fail in production. They write each failure mode as a row in docs/failure-modes.md with the failure, the detection mechanism, and the remediation. Categories include hallucination, retrieval drift, cost runaway, latency spike, provider outage, prompt injection, drift after silent model updates, downstream breakage. Most of these failure modes will eventually become eval cases or monitoring alerts. The catalog is the single most predictive document of whether the system will survive its first month in production.
Day 5; the problem narrative. The agency writes a 600-to-1000-word problem narrative in docs/problem-narrative.md. Not a PRD, not a spec; a narrative that names the user, the workflow, the existing system, the gap, the constraints, and the criterion under which the project is unambiguously a success. The narrative is the document the engineer working on day 12 reads to remember why they are making the trade-off in front of them. If your agency cannot produce a tight narrative by end of week 1, they do not understand the problem yet, and any architecture they propose in week 2 will be wrong.
Red flags by end of day 5. A “discovery deck” instead of written documents. Generic failure modes copied from a template. A data audit that lists data sources but does not characterize them. A problem narrative that reads like marketing copy. The agency proposing a “follow-up discovery sprint” rather than moving to architecture. The first prototype slipping to week 3; at this point the agency is on a four-week scoping cycle, which is the broken pattern this entire engagement shape is designed to replace.
Days 6–9: architecture decisions and the first prototypes
Week 2 is the week the engagement either delivers or collapses. By day 6 the team has enough context to design; by day 9 they have a prototype that runs end-to-end against the eval suite. The named artifacts are an architecture decision record, a system diagram, and a runnable prototype on a feature branch.
Day 6; the architecture decision record. The agency writes docs/adr/0001-system-architecture.md; a single ADR covering the model selection (and why), the routing/abstraction layer (LiteLLM, custom, framework-native), the retrieval strategy (chunking, freshness, re-ranking), the tool-call boundary, the caching strategy, the fallback strategy when the primary provider fails, the observability stack, and the cost ceiling. Most choice is named and justified in two sentences. The ADR is reviewed in a 60-minute architecture session with your engineering lead. Decisions that survive the session are committed; decisions that do not are flagged for a follow-up ADR.
Day 7; the system diagram. A diagram drawn in real time during the architecture session, not pre-fabricated. It labels the model-router boundary, where retrieval happens, where tools execute, where context is assembled, where evals run, where logs land, and where each provider degradation has a fallback. The diagram lives in the repo (Mermaid, Excalidraw, or a checked-in PNG) and is updated as the system evolves. A diagram that is not in the repo does not exist.
Day 8; the first end-to-end prototype. A feature branch with an end-to-end path from input to output, however ugly. Hardcoded values are fine. Mock data is fine. What is not fine is a prototype that does not run. The prototype is wired to the eval suite from day 2; it produces a number, the number is below the threshold, and that number is the second data point on the curve that ends with the shipped feature on day 14.
Day 9; the second eval pass. The agency improves the prototype against the eval baseline and produces the day-9 number. The number must be meaningfully better than the day-8 number. If it is not, the team writes a short note explaining why; usually because the failure mode they hit was an architectural assumption, not a prompt detail, and the ADR needs to be revisited. This is healthy; an architecture revision in week 2 is recoverable, while one in week 6 is not.
Red flags by end of day 9. No ADR. A diagram that does not match the code. A prototype that does not run end-to-end. A prototype that runs but is not wired to the eval suite. The first eval number not improving between day 8 and day 9 with no written explanation. A request to extend the engagement before the first PR has shipped; agencies that ask for more time before they have shipped anything are signaling that they cannot ship at many.
Days 10–14: the first eval-gated PR and the demo cadence
The final five days are the proof. Everything in days 1–9 was scaffolding for one thing: a real PR, gated by the real eval suite, merged into the real repo, demoed on real data. The named artifact is the merged PR; the supporting artifacts are the demo recording and the second-week retro.
Day 10; PR opened, eval gate live. The agency opens a PR against the main branch with the first feature increment. The CI runs the eval suite. The PR description includes the eval delta; baseline 0.61, this PR 0.74, threshold 0.80, gap 0.06. The PR is not merged yet; it is in review. This is the moment the eval discipline you built on day 2 pays its first dividend: the conversation in the PR is structured around a number, not opinions.
Day 11; review and revision. Your engineering lead reviews the PR with the agency engineer. The review is technical, not editorial: failure modes covered, eval cases added for the cases the prototype gets wrong, observability hooks in place, cost-per-request measured. The agency revises and pushes; the eval delta improves. If the gap to threshold closes, the PR is merged. If it does not, day 12 is another revision day, which is fine; the engagement is still on track because the artifact exists and the discipline is real.
Day 12; first PR merged, in staging. The PR is merged to main and deploys to staging with the eval gate active in CI for many subsequent PRs. The system is now in a state where most future change is measured the same way. This is the inflection point of the engagement; from here on, the team is shipping under the discipline they built rather than building the discipline.
Day 13; the demo. A 30-minute demo to the broader stakeholder group. Not a slide-driven demo; a working demo against real data, with the eval dashboard pulled up next to the product. The agency walks through what was shipped, what was not, what the eval gaps are, and what the next two-week increment will produce. The demo is recorded and committed to the repo as docs/demos/2026-week-2.mp4 or linked. Demos are observable artifacts of progress; an engagement without a demo by day 13 has nothing to show stakeholders, which means the next budget conversation will be a hard one.
Day 14; the second-week retro and the standing cadence. A 60-minute retro with the agency, the client product owner, and the engineering lead. What worked, what did not, what the next two-week increment looks like, what the eval threshold is for it. The output is docs/retro-2026-week-2.md and the standing cadence: weekly demo, biweekly retro, daily PR review, eval gate on most merge. This is the shape of a healthy engagement for the next six months, and it is a shape that is impossible to fake.
Red flags by end of day 14. No merged PR. A merged PR that bypassed the eval gate. A demo that runs on synthetic data because the team rarely connected the system to real data. A retro held in person with no written artifact. The agency proposing a “transition phase” before delivery has begun. For a fuller treatment of the cadence beyond day 14, the first 30 days kickoff guide covers the rhythm into month 2.
The broken alternative
Most AI engagements in 2024 and early 2025 ran a different shape, and many agencies still sell it. Week 1 is a kickoff workshop. Week 2 is a discovery sprint. Week 3 is a stakeholder alignment workshop. Week 4 is a written discovery deck and a proposed scope of work. Code, if any, ships in week 5 or 6. Evals, if any, are defined in week 8. The first production deploy is in week 12.
That shape was already wrong in 2024 and is malpractice in 2026. The tooling is mature, the patterns are codified, and the failure modes are known. An agency that needs eight weeks to start writing code is signaling either that they do not know how to start, or that they have a billing model that depends on slow starts. Both are reasons to terminate the engagement and recover the budget. The 14-day shape described above is not aggressive; it is the baseline. Agencies that beat it exist; agencies that miss it are not the future.
The bridge from one shape to the other is the eval baseline on day 2. Once an eval suite exists and is committed to the repo, most subsequent week is forced into the discipline. There is a number, the number is moving, and the engagement either makes the number move or does not. Decks cannot fake an eval delta. Workshops cannot fake a merged PR. The artifacts are either there or they are not, and 14 days is enough to tell.
Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has run more than 24 14-day engagement kickoffs across portfolio companies and clients in the last 18 months.
Frequently Asked Questions
What should an AI agency deliver in the first 14 days of an engagement?
Six concrete artifacts committed to your repo: an engagement charter (day 1), an eval baseline with 20 to 50 ground-truth examples (day 2), a data audit and failure-mode catalog (days 3-4), a problem narrative (day 5), an architecture decision record and system diagram (days 6-7), and a merged eval-gated pull request with a recorded demo (days 10-13). If those artifacts do not exist by day 14, the engagement is failing regardless of how many meetings have been held.
Is 14 days realistic for an AI agency to ship a real PR?
Yes, in 2026 this is the baseline rather than the stretch goal. The tooling is mature, the architectural patterns are codified, and the failure modes are well understood. A forward-deployed AI agency should ship a trivial PR on day 1 to verify access, scaffold an eval suite on day 2, prototype end-to-end by day 8, and merge the first feature PR through an eval gate by day 12. Agencies that need eight weeks to start writing code are running a 2024 service model and should be either renegotiated or replaced.
What is the eval baseline and why is it set on day 2?
The eval baseline is a written set of 20 to 50 ground-truth examples drawn from real production cases, with explicit pass/fail criteria, a numeric threshold tied to a business outcome, and a CI integration that runs the suite per-PR. It is set on day 2 because most subsequent week needs a number to move. Without an eval baseline, code review devolves into opinion-trading and the engagement cannot be measured. With one, most PR has an eval delta in its description and the conversation is structured around evidence.
What does a healthy day 1 of an AI agency engagement look like?
A 90-minute working kickoff with engineers and the client domain expert (not just account leads), the production of a written engagement charter committed to the repo, the granting of repo and staging and CI access, and a small harmless first PR shipped to verify the agency can navigate your branch protections, code-owner rules, and deploy gate. If the agency proposes a multi-week kickoff workshop series instead of working code on day 1, the engagement is already a week behind.
What red flags signal that an AI engagement is going off the rails?
By end of day 2: no commits to your repo, no written eval examples, evals defined as ‘manual testing.’ By end of week 1: a discovery deck instead of written documents, generic failure modes copied from a template, the first prototype slipping to week 3. By end of week 2: no architecture decision record, a system diagram that does not match the code, a prototype that does not run end-to-end, no merged PR, a demo on synthetic data because the team rarely connected to real data, or a request to extend the engagement before any feature has shipped.
How is the 14-day shape different from a traditional consulting engagement?
The traditional consulting shape runs week 1 as kickoff workshop, week 2 as discovery sprint, week 3 as stakeholder alignment, week 4 as a discovery deck. Code ships in week 5 or 6, evals are defined in week 8, and production deploys in week 12. The 14-day forward-deployed shape compresses many of that: the engagement charter is written on day 1, the eval baseline on day 2, the architecture decisions on day 6, and the first eval-gated PR is merged on day 12. The difference is not pace; it is the substitution of artifacts for meetings as the unit of progress.
Who should be on the AI agency side during the first 14 days?
Engineers writing code from day 1, not account executives running workshops. A typical staffing pattern is a tech lead at 50 percent allocation, one or two senior engineers at full allocation, and a part-time product partner who attends the kickoff, the architecture session, and the day-13 demo. The client side mirrors this with an engineering lead, a product owner, and a domain expert who is responsible for ground-truth eval cases. If the agency proposes a project manager as their primary contact for week 1, you have hired the wrong shape of firm.
What is the ‘engagement charter’ and why is it committed to the repo on day 1?
The engagement charter is a one-page document at docs/engagement-charter.md that names the problem, the user, the success metric, the eval threshold, and the on-call person on each side. It is committed to the repo on day 1 because committing it forces the kickoff conversation to converge on specifics rather than dissolving into intent statements. A charter written in a Google Doc and rarely linked to the codebase tends to be forgotten by week 3; a charter that lives next to the code is referenced most time a trade-off has to be made.
Should the AI agency or the client hold the model API keys during the first 14 days?
The client should hold the keys from day 1. The agency uses the client’s Anthropic, OpenAI, and Google accounts, the bill goes to the client, and the agency’s job is to keep the bill predictable and small. Agencies that want to hold the keys ‘for convenience’ during development are setting up token arbitrage as the eventual commercial structure. The first PR on day 1 is a useful test of this: the keys should already be in the client’s CI secret store before the PR runs, not staged in the agency’s environment with a promised handoff later.
What happens after day 14 in a healthy AI engagement?
The standing cadence kicks in: a weekly demo against real data with the eval dashboard visible, a biweekly retro that produces a written artifact in the repo, daily PR reviews with eval-delta in most PR description, and an eval gate active in CI on most merge to main. The second 14-day increment ships a feature with a tightened eval threshold; the third increment hardens the system against the failure modes catalogued in week 1. By day 60 the system is in production, and the engagement has shifted from build to operate without a discontinuity.
Arthur Wandzel