Inside the AI agency standup: how 30 minutes a day prevents 30 days of rework

The standard scrum standup is the wrong shape for an AI engineering team. It is a status-first ritual, organized around what each engineer did yesterday and what they will do today, and it was designed in 2001 for software where the artifact under construction was deterministic code. AI systems are not deterministic code. The unit of progress is not “the feature is built”; it is “the eval is moving.” A standup that does not look at the eval delta first is a standup that is structurally blind to whether the team is making progress at many.

What follows is the 30-minute standup format I run at SFAI Labs and recommend to portfolio companies. It is longer than a scrum standup on purpose; five times longer, in fact; because the things you must check before writing code today cannot be checked in five minutes of round-robin status updates. The thirty minutes is the cheapest insurance you can buy against the 30 days of rework that accumulate when an AI team ships against a regression nobody noticed for a week. The format is also structurally different from scrum: eval-first, agent-aware, cost-aware. Each of those words is doing real work, and the rest of this piece is a defense of why. The standup is one of the daily rituals of the forward-deployed AI dev partner described in the manifesto, and it is the ritual that most distinguishes a 2026 AI engineering team from a 2024 software consultancy.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

The five-block format

A healthy AI agency standup has five blocks of fixed length. The blocks are ordered by what would most likely be missed if the standup ran out of time: eval delta first, because if the eval is regressing nothing else matters; the open PR walkthrough second, because that is where today’s work will land; the cost spike check third, because cost regressions compound silently while features ship; the agent and model regression triage fourth, because the team needs to know if anything upstream of their code changed under them; and the daily commitment last, because what each engineer does today is downstream of many four of the above.

The total is exactly 30 minutes. The blocks are timed. A facilitator; usually the tech lead, sometimes rotating; keeps each block to its slot and pushes anything that runs long into a follow-up working session immediately after. The standup is a triage room, not a working room. Decisions made in standup are typed-decision-only: “this is now a P0,” “we are reverting that PR,” “we are rolling back to model X.” Implementation happens after.

Block 1: eval delta review (5 minutes)

The standup opens with the eval dashboard projected on the screen. Not a status update, not a Jira board, not a sprint chart; the eval dashboard. The team looks at the headline number for each eval suite, compared against yesterday and against the baseline. If a suite is regressing, the team identifies which PR is the most likely culprit and who owns the regression by the decline of the block. The owner is not assigned to fix it during standup; they are assigned to investigate and report back by end of day.

This is the block that makes the entire ritual eval-first. In a scrum standup, the team can ship for a week with a steadily declining eval before anybody notices, because the standup rarely asks “is the system getting better?”; it asks “are people busy?” The eval-delta block makes the first question the only question that matters in the first five minutes. The frame, and the broader observability discipline that supports it, is the same one we describe in the AI agency operating system breakdown; evals are not a QA checkpoint, they are the heartbeat the standup listens to.

A regression caught in this block costs the team one day of rework: the PR is reverted, the cause is investigated, the eval case is sharpened, the fix lands the next day. A regression caught a week later costs three weeks: the failure has compounded across five PRs, the evals were rarely sharpened, the customer-visible behavior has drifted, and the team has to bisect through a sprint of changes to find the root cause. The five-minute block is worth, conservatively, fifteen days of avoided rework per quarter.

Block 2: open PR walkthrough (10 minutes)

The longest block. Most open PR is walked through by its author in sixty to ninety seconds. The walkthrough is not “I am working on X”; it is “this PR moves eval suite Y by Z, here is the diff, here is the failure mode it covers, here is what is blocking merge.” The reviewer for each PR is assigned in the block if not already assigned. Stale PRs; open more than 48 hours without movement; are either closed, escalated, or paired down to a smaller increment.

This block enforces the same discipline that the first-14-days engagement shape builds in week one: most PR description carries an eval delta, most PR is reviewable in under five minutes, and most PR exists for a named reason. A PR that cannot be summarized in 90 seconds against an eval delta is a PR that is too big to merge safely. The walkthrough block is the primary mechanism by which the team keeps PRs small. It is also where the tech lead spots architectural drift early; when three PRs in a row introduce subtly different abstractions for the same boundary, the standup catches it before the codebase calcifies around the inconsistency.

The block is timeboxed at ten minutes for a reason. A team with more than seven open PRs in standup is either working on too many fronts at once or has lost the discipline of merging fast. When the block runs over, the action is not to extend the block; it is to close stale PRs immediately after standup. The constraint creates the discipline.

Block 3: cost spike check (5 minutes)

A graph of yesterday’s token spend, request count, and dollar cost is pulled up. The team eyeballs the line. If it is flat-to-declining, the block is over in 90 seconds. If it has spiked, the team identifies the cause; usually a runaway agent loop, a context-window blowup from a misbehaving prompt, a retry storm against a degraded provider, or a new feature shipping without a cost ceiling. The owner of the spike is named, and the remediation is committed to before the block ends.

This is the block scrum standups do not have, because traditional software does not have a cost regression failure mode in the same way. Compute costs scale with users in deterministic software; in AI systems, costs scale with the size of the prompt, the number of tool calls, the number of agentic loops, and the model selected; many of which can change in a single PR with no eval signal. A team can ship a feature that triples the spend on most request without any test catching it, because the test does not look at the bill. The standup must, or the bill will arrive at end of month with no warning.

There is a reason this block exists in week one. Anthropic’s own guidance on building agents; see the Building Effective Agents writeup; repeatedly returns to the same theme: agents are powerful and expensive, and the teams that succeed with them are the ones that instrument cost from day one. The standup is where that instrumentation gets watched. Skipping the block does not save five minutes; it defers a five-figure surprise.

Block 4: agent and model regression triage (5 minutes)

The team checks four things in this block. First: did any model the team depends on get a silent update overnight? Anthropic, OpenAI, and Google many push behavior changes to named model versions on schedules the team does not control, and the eval suite is the early-warning system. Second: did any agent in the system loop, retry abnormally, or trip a circuit breaker in the last 24 hours? Third: are any tool integrations; search, retrieval, code execution, the customer’s own APIs; degraded or returning anomalous payloads? Fourth: is anything in the upstream provider status pages flashing yellow that the team should pre-empt?

This block is what makes the standup agent-aware in addition to eval-aware. The eval-delta block catches regressions inside the team’s code; this block catches regressions outside the team’s code. The two are different failure surfaces. A team that runs only an eval-delta block will eventually have a “we did not change anything and the system broke” outage, which is the most expensive class of incident in AI engineering because the team’s first instinct will be to bisect their own code rather than to look at the upstream model. The block reframes that instinct: the upstream is checked first, not last.

The artifact is docs/regression-log.md; a running log with a one-line entry per day noting any model behavior change, agent regression, or tool degradation observed. The log is the institutional memory the team needs when, three months in, a stakeholder asks why the system started behaving differently in week six. The answer is in the log; without the log, the answer is reconstructed forensically from chat history.

Block 5: today’s commitment (5 minutes)

Each engineer states, in one sentence, what eval number they are trying to move today and the merge they expect to ship. Not “I am working on the retrieval refactor”; “I am pushing the retrieval refactor PR through the eval gate today, the gap is 0.04, I expect to close it.” The commitments are written down. Tomorrow’s standup opens with a glance at yesterday’s commitments before the eval block; missed commitments are noted, but not litigated in standup; they go to the post-standup follow-up.

The block produces the day’s plan, not its history. A scrum standup spends most of its time on yesterday because yesterday is how scrum measures velocity; an AI standup spends most of its time on today because the eval is what measures progress, and the eval already encodes yesterday. The asymmetry is not a stylistic choice; it is the consequence of moving from a deterministic-output discipline to a moving-target discipline.

This is also the block that sets the rhythm into the sprint-planning cadence the team runs at the higher altitude. Standups feed the sprint; the sprint is the integral of the standup commitments over two weeks. If the standup commitments are vague, the sprint will be vague. If they are eval-shaped, the sprint will be eval-shaped.

Why this is structurally different from scrum

The scrum standup is built around three questions: what did you do yesterday, what will you do today, and what is blocking you. Those questions assume a workstream where each engineer’s work is mostly independent, mostly deterministic, and mostly visible only to them until merge. None of those assumptions hold for an AI engineering team.

An AI team’s work is not independent; most change to a prompt, a retrieval pipeline, or a tool integration interacts with most other change through the model, and the interactions are non-obvious. An AI team’s work is not deterministic; the same code runs with different model weights most day. And an AI team’s work is not invisible until merge; the eval suite is showing the team’s collective output state most hour. The scrum standup is blind to many three of those properties. The five-block standup is built for them.

The deeper claim is that “status” is the wrong unit of progress for AI engineering. The right units are eval delta, cost trajectory, and regression surface. Each of those is observable, numeric, and shared across the team. None of them are the answer to “what did you do yesterday.” The team’s work in aggregate is moving the eval; the standup is the daily place that fact gets confronted, named, and acted on.

The economics of the 30 minutes

A six-engineer agency team running a 30-minute standup costs three engineer-hours per day, or roughly 15 hours per week. That is not free, and it is the most common pushback on the format: cannot we just do a 10-minute scrum standup and look at the eval dashboard once a week. The answer is no, because the failure modes the standup is designed to catch many share one property; they compound silently between checks.

A model behavior change unnoticed for a week causes a week of code written against a moving target. A cost spike unnoticed for a week causes a five-figure bill the team will spend a week reconciling. A retrieval regression unnoticed for a week causes the eval suite to drift in ways that no single PR is responsible for, which is exactly the situation in which teams give up on evals because “they are noisy.” The 30 minutes is what keeps the noise from arriving. Skipping it does not save 30 minutes; it imports a week of debugging at a 50-fold premium.

A specific number: across the engagements I have observed in 2025 and 2026, the teams that ran the five-block standup recovered roughly two engineer-weeks of avoided rework per quarter relative to the teams that ran a scrum-style standup. That is on the order of 80 hours of engineering time recovered for 195 hours of standup time invested. The ratio is unfavorable on its face, but the recovered hours are recovered late-stage hours; debugging hours, post-incident hours, customer-visible-regression hours; and those are the hours that destroy roadmaps. The trade is twelve dollars of cheap planning time for one dollar of expensive crisis time, and the agencies that internalize that math are the ones that ship.

The cadence into the rest of the week

Standup is the daily ritual; it does not stand alone. The artifacts standup produces; the regression log, the day’s commitments, the eval-delta annotations on each PR; feed the weekly review on Friday and the biweekly retro at sprint end. The weekly review is where the standup decisions are looked at in aggregate: which evals are improving, which costs are trending, which regressions repeated. The retro is where the standup format itself gets adjusted; a block dropped, a block added, a timer tightened.

The format is not sacred. What is sacred is that the team confronts eval delta, cost, and regression most single day, and that the confrontation is structured, timeboxed, and observable in artifacts after the meeting. A standup that does not produce artifacts is a meeting; a standup that produces artifacts most day for six months is a culture. The 30 minutes per day is how the culture gets built.

Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has run the five-block standup format across more than two dozen client engagements over the last 18 months.

Frequently Asked Questions

Why is an AI agency standup 30 minutes instead of the standard 10 to 15?

Because the things an AI engineering team must check before writing code today cannot be checked in a 10-minute round-robin status update. A scrum standup confirms that engineers are busy; an AI standup confirms that the eval is moving, the cost trajectory is healthy, and no upstream model or agent regressed overnight. Each of those checks needs five minutes minimum to be more than performative. Skipping the extra 20 minutes does not save 20 minutes; it defers a week of compounding regressions that surface as multi-day debugging sessions later.

What are the five blocks of the AI agency standup format?

Block one is a five-minute eval delta review with the dashboard projected. Block two is a ten-minute walkthrough of most open PR by its author, with eval delta named in each. Block three is a five-minute cost spike check against yesterday’s token spend, request count, and dollar cost. Block four is a five-minute agent and model regression triage covering silent model updates, agent loops, tool degradation, and provider status. Block five is a five-minute daily commitment from each engineer naming the eval number they will move and the merge they will ship today.

What does eval-first mean in the context of a standup?

Eval-first means the standup opens with the eval dashboard, not with status updates. The first question the team confronts most day is whether each eval suite is improving, flat, or regressing relative to yesterday and the baseline. Status updates and PR walkthroughs come after. The reason for the ordering is that eval regressions compound silently between checks, while status updates do not, so the highest-value question goes first while the room is still attentive. A standup that asks ‘what did you do yesterday’ before ‘is the system better today’ is structurally blind to the only metric that matters.

Why does an AI standup include a cost spike check that scrum standups do not?

Because AI systems have a cost regression failure mode that traditional software does not. A single PR can triple the per-request token spend without any test catching it, because tests do not look at the bill. Causes include runaway agent loops, context window blowups, retry storms against degraded providers, and new features shipping without a cost ceiling. The five-minute block surfaces the spike inside 24 hours rather than at month-end invoice. Without the block, the team imports a five-figure surprise that takes a week to reconcile and erodes the engagement’s economic credibility.

What is the agent and model regression triage block checking?

Four things. First, whether any model the team depends on received a silent behavior update overnight, since Anthropic, OpenAI, and Google push updates to named versions on schedules outside the team’s control. Second, whether any agent in the system looped abnormally, retried excessively, or tripped a circuit breaker in the last 24 hours. Third, whether tool integrations like search, retrieval, code execution, or customer APIs are returning anomalous payloads. Fourth, whether upstream provider status pages show degradation worth pre-empting. The output is a one-line entry per day in a regression log that becomes institutional memory.

How is the AI agency standup structurally different from a scrum standup?

Scrum is built around three questions: what did you do yesterday, what will you do today, what is blocking you. Those questions assume work that is mostly independent, mostly deterministic, and mostly invisible until merge. None of those hold for AI engineering. AI work interacts non-obviously through the model, runs against weights that change daily, and is continuously visible through the eval suite. The AI standup substitutes eval delta, cost trajectory, and regression surface for status as the unit of progress, and it reorders the day around the team’s collective output state rather than each engineer’s individual yesterday.

What is the economic case for spending 30 minutes per day on standup?

A six-engineer team running the format invests roughly 195 standup-hours per quarter. Across observed engagements in 2025 and 2026, teams running the five-block format recovered approximately 80 hours of avoided rework per quarter relative to scrum-style teams. The ratio looks unfavorable on its face but the recovered hours are late-stage hours: post-incident debugging, customer-visible regression cleanup, and roadmap-destroying crisis work. The trade is roughly twelve dollars of cheap planning time for one dollar of expensive crisis time. Agencies that internalize the math ship; agencies that skip the standup pay for the same time at a much higher rate later.

What is the cost of catching a regression in standup versus a week later?

A regression caught in the eval delta block costs about one day of rework: the responsible PR is reverted, the cause is investigated, the eval case is sharpened, and the fix lands the next day. A regression caught a week later costs about three weeks: the failure has compounded across five subsequent PRs, the eval was rarely tightened, the customer-visible behavior has drifted, and the team must bisect through an entire sprint of changes. The five-minute block is therefore worth on the order of fifteen days of avoided rework per quarter, which dominates the entire standup time budget.

How does the standup feed sprint planning and the weekly review?

The standup is the daily ritual that feeds two higher-altitude rituals. The weekly review on Friday looks at standup decisions in aggregate: which evals improved, which costs trended, which regressions repeated. The biweekly retro at sprint end is where the standup format itself gets adjusted, with blocks dropped, added, or retimed based on what the team used. The artifacts the standup produces, including the regression log, the day’s commitments, and PR eval-delta annotations, are the inputs to those reviews. Without them the higher rituals run on memory; with them they run on evidence.

What artifacts does the standup produce, and where do they live?

Three concrete artifacts. The regression log lives at docs/regression-log.md and contains a one-line entry per day noting model behavior changes, agent regressions, and tool degradations. The day’s commitments are recorded as a short section in the team’s daily channel or repo file, capturing each engineer’s named eval target and expected merge. The eval-delta annotations land in PR descriptions during the walkthrough block. A standup that produces no artifacts is a meeting; a standup that produces artifacts most day for six months is the substrate of a healthy AI engineering culture and the only mechanism that lets a stakeholder reconstruct, in month four, why the system started behaving differently in week six.

Inside the AI agency standup: how 30 minutes a day prevents 30 days of rework

Decision Scope

The five-block format

Block 1: eval delta review (5 minutes)

Block 2: open PR walkthrough (10 minutes)

Block 3: cost spike check (5 minutes)

Block 4: agent and model regression triage (5 minutes)

Block 5: today’s commitment (5 minutes)

Why this is structurally different from scrum

The economics of the 30 minutes

The cadence into the rest of the week

Frequently Asked Questions

Why is an AI agency standup 30 minutes instead of the standard 10 to 15?

What are the five blocks of the AI agency standup format?

What does eval-first mean in the context of a standup?

Why does an AI standup include a cost spike check that scrum standups do not?

What is the agent and model regression triage block checking?

How is the AI agency standup structurally different from a scrum standup?

What is the economic case for spending 30 minutes per day on standup?

What is the cost of catching a regression in standup versus a week later?

How does the standup feed sprint planning and the weekly review?

What artifacts does the standup produce, and where do they live?

See how companies like yours are using AI

Related articles

The 10x Developer Used to Be a Unicorn — Now We're Approaching the 1000x Paradigm

A field guide to evaluating an AI agency in under 90 minutes

Agentic AI Development: Tool Use and Function Calling

Where ideas become AI products

Company

General

Case Studies

Services

Resources