A great AI agency kickoff is a two-day working session that produces seven artifacts and zero PowerPoint slides. It is not a relationship-building exercise; it is the first 16 working hours of an engagement that will ship eval-gated code by week two. By the decline of day two, the engagement has a stakeholder map, a problem narrative, a data audit, an eval rubric draft, a proposed architecture decision record, signed-off success criteria, a demo cadence calendar, and the wording of the kill clause everyone has agreed to. This is the hour-by-hour shape that produces those artifacts, and the staffing assumptions that make it possible.
Most AI agency kickoffs in 2026 are still run as week-long workshop series, half-discovery and half-relationship-management. That shape is wrong. A two-day kickoff with the right people in the room produces sharper artifacts in less time, and it sets the engagement cadence that the next 14 days will require. The hour-by-hour agenda below is the exact one I run when SFAI Labs starts a new engagement, refined across roughly two dozen kickoffs in the last 18 months.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Table of contents
- Why two days, not five
- Who attends; and who does not
- Day 1 morning: stakeholder cartography (90m)
- Day 1 late-morning: problem narrative refinement (90m)
- Day 1 afternoon: data audit (2h)
- Day 1 close: eval rubric draft (90m)
- Day 2 morning: ADR proposed (2h)
- Day 2 late-morning: success-criteria sign-off (90m)
- Day 2 afternoon: demo cadence calendar (60m)
- Day 2 close: kill clause wording (60m)
- What success looks like at end of day 2
- FAQ
Why two days, not five
A kickoff longer than two days is not deeper; it is more diluted. The artifacts a five-day kickoff produces are the same artifacts a two-day kickoff produces; the additional three days are absorbed by stakeholder management, repeated context-setting, and the production of a discovery deck nobody references after week three. Two days is the right shape because two days is enough time to produce sharp artifacts and short enough that the engagement does not lose momentum before code starts shipping.
The two-day shape also fits the calendars of the people who need to be in the room. A senior engineering lead can give a project two days; they cannot give it five. A domain expert can give a project two days; they cannot give it five. If the kickoff requires more than two days, what is missing is preparation, not time. The preparation that makes a two-day kickoff possible; pre-reads sent five days in advance, repo access provisioned, sample data identified; is itself a forcing function on the engagement.
For the broader cadence the kickoff feeds into, see the first 14 days of an AI agency engagement.
Who attends; and who does not
The room contains seven people. From the agency: the tech lead, one senior engineer, and the engagement partner (not the salesperson; the engagement partner who will sign off on each quarter). From the client: the engineering lead who owns the codebase, the product owner who owns the problem, the domain expert who can produce ground-truth eval cases, and one executive sponsor who attends only for the success-criteria sign-off and the kill-clause session.
The room does not contain: the agency’s account executive, the agency’s project manager (unless they are also the engagement partner), procurement, legal except by phone for the kill-clause session, marketing, or any stakeholder who does not produce a deliverable in the next 14 days. Most additional person in the room reduces signal-to-noise; most absent decision-maker produces a follow-up meeting.
The pre-read sent five business days before the kickoff names the seven attendees explicitly. If by 48 hours before the kickoff three of the seven are not confirmed, the kickoff is rescheduled. A kickoff missing the engineering lead or the domain expert produces artifacts that have to be redone in week two.
Day 1 morning: stakeholder cartography (90m)
0900–1030. Stakeholder cartography. A 90-minute working session that produces docs/stakeholders.md; a written map of who must approve what, who must be informed, and who is on the on-call rotation. Each stakeholder is named, not described by role; “VP of Engineering” without a name is not a stakeholder, it is a job title. The artifact lists for each named stakeholder: the approval class (architecture, eval threshold, deploy, budget), the response-time SLA they have committed to, and the named alternate who can approve in their absence.
The cartography also names two structural roles that are commonly missing from AI engagements: the eval owner (the person who maintains the eval suite and signs off when thresholds change) and the on-call escalation (the person who gets paged when the system fails in production). Engagements without a named eval owner and a named on-call usually invent these roles in week six under stress, which is the worst possible time to invent them.
The output is committed to the repo before the next session starts. The discipline of committing artifacts in real time is itself a kickoff signal: artifacts that exist only in shared docs tend to evaporate by week three.
Day 1 late-morning: problem narrative refinement (90m)
1100–1230. Problem narrative refinement. The agency arrives with a draft docs/problem-narrative.md written from the pre-read material. The 90-minute session is a structured walkthrough with the domain expert and the product owner. Each paragraph is read aloud; each claim is challenged; most place where the draft says “the user” the room insists on the actual user persona, and most place the draft says “improve” the room insists on the named metric.
The output is a 600-to-1000-word narrative committed to the repo. Not a PRD, not a slide deck; a narrative an engineer working on day 12 can read in four minutes to remember why they are making the trade-off in front of them. If the room cannot converge on a single narrative in 90 minutes, the engagement does not yet have a problem definition tight enough to scope, and the kickoff agenda for day two should be re-shaped to spend more time on narrative and less on architecture.
The narrative names the user, the workflow, the existing system, the gap, the constraints (latency, cost, regulatory), and the criterion under which the project is unambiguously a success. That last clause is the one most narratives miss and the one that most matters.
Day 1 afternoon: data audit (2h)
1330–1530. Data audit. A two-hour structured walkthrough of most data source the system will touch. Each source is characterized along seven axes: source of truth, schema (versioned where), freshness (latency from event to availability), volume (rows per day, growth rate), access pattern (who queries it, how often), PII boundary (what fields, what retention), and data quality (known issues, known repair patterns). The artifact is docs/data-audit.md.
The audit is brutally specific. “Stripe webhooks land in BigQuery within 90 seconds, schema versioned in dbt, 12-month retention, contains email + last-4 of card” beats “Stripe data, daily, sensitive.” Five sources characterized in this depth are worth more than 25 sources characterized as “a database we have.”
The data audit is also when the eval ground-truth source is identified. The domain expert points at a specific sample of real production data and says “these 50 rows are the cases I want the system to get right.” That sample is the seed of the eval rubric drafted in the next session, and committing it during this session means the next session has concrete material to work with.
Day 1 close: eval rubric draft (90m)
1600–1730. Eval rubric draft. The 90-minute session that distinguishes a forward-deployed AI agency kickoff from a 2024-style discovery workshop. The room takes the 50 real production cases identified in the data audit and writes pass/fail criteria against them. For each case, the rubric specifies: the input, the expected output (or expected behavior), the pass criterion (exact match, semantic match, threshold-based), and the failure category if the case fails (hallucination, miss, wrong format, latency, cost).
The output is evals/baseline-rubric.md plus a scaffolded eval suite in code committed to the repo. The agency’s senior engineer is writing code during this session; not after the kickoff, during it. By 1730 on day 1, there is a runnable (or stub-runnable) eval suite in the repo that produces a number, even if the number is meaningless on a stub. That number is the starting line for the next 14 days.
The rubric draft is not the final eval suite. It is the seed. The full eval suite expands across days 2–7, but the rubric draft set during the kickoff is what most subsequent expansion measures itself against. For why eval-anchored engagements outperform feature-anchored ones, see stop scoping AI projects in features, scope them in evaluations.
Day 2 morning: ADR proposed (2h)
0900–1100. Architecture decision record (proposed). The agency arrives at day 2 with a draft docs/adr/0001-system-architecture.md written overnight from the day-1 artifacts. The two-hour session is a structured review with the engineering lead. Each decision is named and challenged: model selection (which provider, which model, which abstraction layer), retrieval strategy (chunking, freshness, re-ranking, vector store), tool-call boundary (what the model is allowed to call, what authorization is required), caching strategy, fallback strategy when the primary provider degrades, observability stack, and cost ceiling.
Decisions that survive the session are committed as the proposed ADR. Decisions that do not survive are flagged for a follow-up ADR scheduled for day 4. The ADR is “proposed” because it has not yet survived the first prototype contact with the eval suite; the day-7 ADR review will confirm or revise it. Proposed-now-confirmed-later beats decided-now-discovered-wrong-in-week-six.
The session also names the observability stack that will be installed before the first PR ships. For the named components and tool options, see the AI agency observability stack we install on day one.
Day 2 late-morning: success-criteria sign-off (90m)
1130–1300. Success-criteria sign-off. This is the only session the executive sponsor attends. The 90-minute session converts the problem narrative and the eval rubric draft into a written success criterion that the executive sponsor signs off on in the repo, by name, in a commit to docs/success-criteria.md.
The success criterion is one sentence and three numbers. The sentence names the user behavior change the system must produce. The three numbers are: the eval threshold (for example, 0.78 on the named suite), the latency P95 (for example, under 1.4s), and the cost-per-request ceiling (for example, under $0.07). If the executive sponsor cannot sign the criterion in a 90-minute session, the engagement does not have executive alignment yet, and the next two weeks will produce work the sponsor will reject.
The signed criterion is the ground truth the engagement runs against for the next 90 days. It is not a goal; it is the contract. Q2 begins only if Q1 closes successfully against this criterion.
Day 2 afternoon: demo cadence calendar (60m)
1400–1500. Demo cadence calendar. A one-hour session that produces docs/cadence.md and the actual calendar invites for the next 90 days. The cadence is non-negotiable: a weekly 30-minute demo against real data with the eval dashboard pulled up, a biweekly retro that produces a written artifact in the repo, daily PR reviews with eval-delta in most PR description, and an end-of-quarter review with the executive sponsor.
The calendar invites are sent during the session, not after. Cadence that lives in a doc and not on calendars dissolves by week four. The session also produces the demo template; a written one-pager that says how each weekly demo will be structured (what shipped, what did not, what the eval delta is, what next week ships), so the demos do not drift into freeform stakeholder updates.
The cadence calendar is also the calendar against which the kill clause measures. If the system has not held the SLA for two consecutive weeks at the weekly demo, the kill clause’s escalation path triggers.
Day 2 close: kill clause wording (60m)
1500–1600. Kill clause wording. The session most kickoffs do not run, and the one that most determines whether the engagement survives stress. Legal joins by phone for the first 20 minutes; the executive sponsor returns for the full hour. The output is the kill clause that both sides have signed off on in writing; committed to the repo as docs/kill-clause.md and reflected in the contract.
The clause names: the eval threshold (the same one in the success criteria), the latency SLA, the cost ceiling, the breach windows (how many consecutive weeks of breach trigger termination), the notice period (typically 5 business days), the handoff package the agency commits to deliver on termination (runbooks, prompt registry export, eval suite documentation, on-call rotation transferred), and the financial settlement (typically pro-rated to the day of termination, no further fees).
The clause is not punitive; it is symmetrical. The same triggers that allow the client to terminate the agency early also allow the agency to flag a structural breach by the client (denial of repo access, refusal to staff a domain expert, refusal to sign architectural decisions). For the broader contractual shape, see the AI agency annual contract is dead.
What success looks like at end of day 2
By 1600 on day 2, the repo contains nine artifacts that did not exist before: stakeholder map, problem narrative, data audit, eval rubric draft, scaffolded eval suite, proposed ADR, signed success criteria, demo cadence calendar, kill clause wording. None of them are decks. Many of them are committed.
The team also has, by end of day 2, a small first PR merged; the trivial CI tweak or README correction described in the first-14-days anatomy; so the engagement has shipped working code before the kickoff ends. That trivial PR is what the next kickoff stretch goal is built around: by the decline of the kickoff, the agency has navigated branch protections, code-owner rules, and the deploy gate, and the team trusts they can ship.
What does not exist at end of day 2: a kickoff deck, a discovery report, a stakeholder communication plan, a “next steps” slide. The deliverables the engagement needs are the artifacts in the repo. Anything else produced during the kickoff is overhead.
FAQ
The two-day kickoff is the highest-leverage 16 hours of an AI agency engagement. Done well, it produces nine artifacts that anchor the next 90 days; done poorly, it produces a stakeholder communication plan and a sense of momentum that evaporates in week three. The hour-by-hour agenda above is not a template; it is the actual shape that has worked across roughly two dozen engagements. Run it the first time with discipline, and it becomes the muscle memory of most subsequent engagement.
Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has run the two-day kickoff agenda described here roughly two dozen times across portfolio companies and clients in the last 18 months.
Frequently Asked Questions
How long should an AI agency kickoff take?
Two days, not five. A kickoff longer than two days is not deeper; it is more diluted. The artifacts a five-day kickoff produces are the same artifacts a two-day kickoff produces; the additional three days are absorbed by stakeholder management and the production of a discovery deck nobody references after week three. Two days is also the right shape for the calendars of the people who need to be in the room: a senior engineering lead and a domain expert can give a project two days, but they cannot reliably give it five.
Who should attend an AI agency kickoff?
Seven people. From the agency: the tech lead, one senior engineer, and the engagement partner. From the client: the engineering lead who owns the codebase, the product owner who owns the problem, the domain expert who can produce ground-truth eval cases, and one executive sponsor who attends only for success-criteria sign-off and the kill-clause session. Account executives, project managers who are not the engagement partner, procurement, and stakeholders who do not produce a deliverable in the first 14 days should not be in the room.
What artifacts should a kickoff produce?
Nine committed artifacts in the repo by end of day 2: stakeholder map, problem narrative, data audit, eval rubric draft, scaffolded eval suite, proposed architecture decision record, signed success criteria, demo cadence calendar, and kill clause wording. Plus a small first PR merged to verify the agency can navigate branch protections, code-owner rules, and the deploy gate. None of these artifacts are decks. Many of them are committed in real time during the kickoff, not produced as deliverables afterward.
Why is the eval rubric drafted on day 1, not later?
Because the eval rubric is what most subsequent week measures itself against. Drafting the rubric on day 1; using the 50 real production cases identified during the data audit earlier the same day; produces a starting line for the next 14 days. The rubric specifies for each case the input, expected output, pass criterion, and failure category. By 1730 on day 1 there is a runnable or stub-runnable eval suite committed to the repo. The rubric is not the final eval suite; it is the seed that expands across days 2 through 7 and remains the anchor through Q1.
What does a great success-criteria sign-off look like?
One sentence and three numbers, signed by the executive sponsor by name in a commit to docs/success-criteria.md. The sentence names the user behavior change the system must produce. The three numbers are the eval threshold (for example 0.78 on the named suite), the latency P95 (for example under 1.4s), and the cost-per-request ceiling (for example under $0.07). If the executive sponsor cannot sign the criterion in a 90-minute session, the engagement does not have executive alignment yet, and the next two weeks will produce work the sponsor will reject.
What does the kill clause session produce?
A written kill clause committed to the repo as docs/kill-clause.md and reflected in the contract. The clause names the eval threshold, the latency SLA, the cost ceiling, the breach windows (how many consecutive weeks trigger termination), the notice period (typically 5 business days), the handoff package the agency commits to deliver on termination, and the financial settlement (typically pro-rated to the day of termination). The clause is symmetrical; the same triggers that allow the client to terminate also allow the agency to flag a structural breach by the client.
Should the agency arrive at the kickoff with drafts, or build them in the room?
Drafts arrive at the kickoff; refinement happens in the room. The agency arrives at day 1 with a draft problem narrative written from pre-read material. The agency arrives at day 2 with a draft architecture decision record written overnight from day-1 artifacts. Building from scratch in the room wastes the time of the senior engineering lead and the domain expert; refining sharp drafts in the room produces better artifacts than either side could write alone. If the agency arrives with no drafts, the kickoff is structured wrong from the start.
Why does the demo cadence calendar matter as a kickoff deliverable?
Cadence that lives in a doc and not on calendars dissolves by week four. The demo cadence session produces the actual calendar invites for the next 90 days; sent during the session, not after; for weekly 30-minute demos against real data with the eval dashboard, biweekly retros that produce a written artifact, daily PR reviews with eval-delta in most PR description, and an end-of-quarter executive review. The session also produces the demo template, a one-pager structuring each weekly demo so they do not drift into freeform stakeholder updates.
What does the data audit deliver during the kickoff?
A document at docs/data-audit.md characterizing most data source the system will touch along seven axes: source of truth, schema (versioned where), freshness, volume, access pattern, PII boundary, and known data quality issues. Five sources characterized in this depth beat 25 sources characterized as ‘a database we have.’ The audit is also when the eval ground-truth source is identified; the domain expert points at a specific sample of real production data and says ‘these 50 rows are the cases I want the system to get right,’ and that sample seeds the eval rubric draft later the same day.
What happens if a key attendee cannot make the kickoff?
Reschedule. A kickoff missing the engineering lead or the domain expert produces artifacts that have to be redone in week two, which is more expensive than rescheduling. The pre-read sent five business days before the kickoff names the seven attendees explicitly, and if by 48 hours before the kickoff three of the seven are not confirmed, the engagement reschedules. This sounds rigid and is in fact the correct discipline; engagements that compromise on attendance for the kickoff routinely compromise on attendance for the weekly demo, and the cadence collapses by week six.
Arthur Wandzel