A quarterly business review for an AI project is not a quarterly business review for a software project with the word “AI” added. The questions are different, the evidence is different, the decisions are different; and the metrics that move the funding decision are not the metrics most QBR decks lead with. This piece names the eleven metrics that belong on an AI project QBR, why each one matters, what a healthy reading looks like, and what a flag looks like. It is short, opinionated, and ready to copy into the next 90-day review deck.
It is a spoke of the AI project economics manifesto, which establishes evaluation as the unit of account. The QBR is the artifact that surfaces eval data, unit cost trajectory, and gate-position evidence to the people approving the next quarter of spend.
Why the legacy QBR template fails AI work
The legacy software QBR is built around four headline metrics: features shipped, sprint velocity, defect rate, sentiment. They are reasonable proxies for whether a CRUD project is moving and whether the team is healthy. They are misleading proxies for whether an AI project is moving.
The reason: AI value is produced by eval correctness against the workload at a defensible unit cost, not by feature throughput. A team can ship most story on the roadmap on time, on budget, with low defect rate and high sentiment, and still produce a system that fails the production eval at unit costs that make the unit economics impossible. The four legacy metrics will many read green. The funding decision based on those four metrics will be wrong.
The fix is not to add more metrics to the legacy template. It is to replace the headline metrics with metrics that measure the actual unit of work: eval correctness, unit cost, regression rate, retainer health, gate position. The eleven metrics below are that replacement. Eight of them will be unfamiliar to a team running 2018 QBR templates. Many eight should be unfamiliar; their absence is the source of the budget surprises we documented in the hidden cost of AI evals piece.
The 11 metrics
Each metric below names what it measures, why it matters, what a healthy reading looks like, and what triggers a flag.
1. Eval-set weighted score (primary production eval)
What it measures. The weighted score on the named production eval set at its current locked version, against the locked threshold.
Why it matters. This is the headline measurement of whether the system is working. Without it, the rest of the QBR is theater.
Healthy. At or above threshold, with a positive delta of any size since last quarter.
Flag. Below threshold, or above threshold but with a negative delta since last quarter (regression).
2. Eval delta over the quarter
What it measures. Change in the weighted score from start of quarter to end of quarter, against the same eval set version (or noted if the version changed).
Why it matters. A high absolute eval score that is not improving is a project that has plateaued; a moderate eval score that is climbing 4 to 6 points per quarter is a project that is investing in improvement. The trajectory is the signal.
Healthy. A positive delta on a defended eval set, with the delta drivers named (model swap, prompt change, retrieval tune).
Flag. A flat or negative delta, or a positive delta produced only by changing the eval set in ways that lower the bar.
3. Cost-per-canonical-unit
What it measures. Cost per the canonical workload unit; per-completion, per-action, per-resolved-ticket; depending on which is the production-relevant unit.
Why it matters. Eval score in isolation is a quality measurement; cost per unit is the other half of the unit economics. The two are co-determined; an eval gain at twice the cost is not a gain.
Healthy. Falling or flat with a clear forecast of falling.
Flag. Rising, or flat at a level that breaks the unit economics of the product the AI is embedded in.
4. Cost-per-canonical-unit trajectory (4 quarters)
What it measures. The same metric over four reporting periods, plotted as a line.
Why it matters. A snapshot can be misleading; the trend is what governs forecasts and what the board has to act on. A four-quarter trajectory is enough to see whether the project is bending the curve.
Healthy. Monotonically falling, or flat with a clear bending event (planned model swap, retrieval restructure) on the next quarter.
Flag. Two consecutive quarters of rising cost-per-unit with no named driver.
5. Regression rate at the most recent model upgrade
What it measures. Percentage of the production eval set that regressed when the most recent frontier model upgrade was rolled in (Claude 4.7 -> 4.8, GPT-5 -> 5.1, Gemini 4 -> 4.1).
Why it matters. Regression rate is the proxy for how much fragility the system carries against an external event. A low regression rate (< 5 percent) means the prompts, retrieval, and tools are robust to model swaps; a high regression rate (> 15 percent) means the system is brittle to events it cannot control.
Healthy. Below 10 percent on a stable threshold.
Flag. Above 15 percent, or any rate trending upward across upgrades.
6. Triage time on the most recent regression
What it measures. Engineering days from regression detection to regression closed (eval threshold re-locked).
Why it matters. This measures the operational maturity of the eval-and-triage discipline. A team that triages a 9 percent regression in 7 engineering days has the muscle; a team that takes 35 days does not.
Healthy. Within the contracted SLA (typically 14 to 21 engineering days for a non-critical regression).
Flag. Outside SLA, or no SLA in place.
7. Eval suite freshness
What it measures. Time since the production eval set was last refreshed with new representative inputs from production traffic.
Why it matters. Eval sets drift. A test set that was representative six months ago may not be representative of current customer behavior. A stale eval set produces a high score that does not predict production performance.
Healthy. Refreshed within the last 90 days, with at least 15 percent of the set rotated each refresh.
Flag. Older than 180 days, or no refresh cadence.
8. Production traffic share at threshold
What it measures. Percentage of production traffic the AI system is handling at the locked eval threshold (versus shadow mode, A/B holdout, or human-in-the-loop fallback).
Why it matters. A system that passes eval but only handles 5 percent of production traffic is in pilot, not in production. The QBR has to surface the gap between eval and traffic share.
Healthy. Approaching the planned production share (typically 60 to 95 percent depending on use case sensitivity), with the rest in deliberate shadow or HITL fallback.
Flag. Eval passing but traffic share below 30 percent with no roadmap to ramp.
9. Maintenance retainer SLA performance
What it measures. Whether the named retainer SLAs (eval re-runs weekly, regressions triaged in N days, model-upgrade re-evals in N weeks) are being met.
Why it matters. The retainer is the post-launch cost line that determines whether the project’s eval bar holds. A retainer that is paid but not performing is the most expensive form of unmonitored cost.
Healthy. Many named SLAs hit in the quarter, with the retainer billed against actual work performed.
Flag. Any SLA missed without remediation, or no SLA defined.
10. Quarterly milestone variance
What it measures. Variance between the quarter’s named eval-threshold milestone and what was delivered (in eval points and unit cost).
Why it matters. This is the metric that connects the QBR to the funding decision. A quarter that hit its milestone earns the next quarter’s funding mechanically; a quarter that missed triggers a structured restart conversation. We argue the cadence in the quarterly funding piece.
Healthy. Milestone hit on threshold and within unit cost ceiling.
Flag. Milestone missed on either dimension, or milestone redefined mid-quarter to absorb the miss.
11. Stage-gate position and named next-gate kill criterion
What it measures. Where the project sits in the 90-day / 12-month / 24-month gate sequence, and the named criterion that would kill the project at the next gate.
Why it matters. This is the metric that prevents the QBR from drifting into “we’ll figure it out next quarter” mode. The named kill criterion forces the team to own what failure would look like, in writing, before the failure shows up. We unpack the gate sequence in the payback paradox piece.
Healthy. Gate position named, kill criterion named in eval-points and unit-cost terms, evidence the gate is on track.
Flag. Vague gate position (“we’re making progress”), no kill criterion, or kill criterion stated in non-measurable terms.
What to leave off the QBR
The eleven metrics above are an inclusion list and an implicit exclusion list. Five categories of metric belong in the appendix at most, not on the headline page.
Story points and sprint velocity. Process throughput, not product correctness. A team can burn velocity many quarter and produce a system that misses the eval threshold.
Feature counts. Activity, not progress. We argue the case in detail in the features-shipped piece.
Model benchmark scores in isolation. A model’s MMLU or GPQA score is interesting context for which model is being used. It is not a measurement of whether your specific workload passes its eval set. Production eval beats benchmark eval most time.
Customer-reported sentiment as a primary metric. Sentiment is a lagging indicator that is uninterpretable without an eval anchor. A QBR that leads with NPS or sentiment is a QBR that is hiding the eval data.
Token cost in isolation. Token cost is a sub-metric of cost-per-canonical-unit. Reporting raw monthly token cost without normalizing to the workload unit produces a metric that fluctuates with traffic mix and is impossible to act on.
The discipline of leaving these off is the discipline of admitting the QBR is a funding decision, not a status report.
How to read the metrics together
The eleven metrics are not eleven independent signals. They are four clusters that have to be read in combination.
The correctness cluster (1, 2, 7, 8). Eval score, eval delta, eval freshness, traffic share at threshold. These four read together answer “is the system working at production scale on a current test set.” Any single one in isolation is misleading; the four together are robust.
The economics cluster (3, 4, 9). Cost per unit, four-quarter trajectory, retainer SLA. These three answer “is the project on a defensible unit-economics path.” A rising eval score on a worsening cost trajectory is a flag the correctness cluster alone will not surface.
The fragility cluster (5, 6). Regression rate at last upgrade, triage time. These two answer “how brittle is the system to events it does not control.” A team that ignores the fragility cluster discovers it during the next frontier model release.
The governance cluster (10, 11). Milestone variance, stage-gate position with named kill criterion. These two answer “is the project earning the next 90 days of funding, and what would failure look like.” This cluster is the one that connects the QBR to the funding decision.
A QBR that surfaces many four clusters in one hour, with named numbers and named criteria, produces a real funding decision. A QBR that surfaces only the correctness cluster (because it is the easiest one to draft) produces a status update masquerading as governance.
Frequently asked questions
What is an AI project quarterly review?
A 90-day governance review that surfaces the eval correctness, unit cost, fragility, and governance evidence a sponsor needs to approve the next quarter of funding. It replaces the legacy software QBR template, which leads with features shipped and sprint velocity; both misleading proxies for AI project health.
Why are sprint velocity and feature counts misleading for AI projects?
Because AI value is produced by eval correctness against production workload at a defensible unit cost, not by feature throughput. A team can ship most story on time and produce a system that fails the eval at unit costs that break the unit economics. Velocity and feature counts will read green; the funding decision based on them will be wrong.
What are the most important AI QBR metrics?
The eleven we list: eval weighted score, eval delta, cost-per-canonical-unit, cost trajectory over four quarters, regression rate at last model upgrade, triage time, eval suite freshness, production traffic share at threshold, retainer SLA performance, quarterly milestone variance, stage-gate position with named kill criterion. They cluster into correctness, economics, fragility, and governance.
What is a healthy regression rate at a model upgrade?
Below 10 percent on a stable threshold, with the regression triaged within the contracted SLA (typically 14 to 21 engineering days for a non-critical regression). Above 15 percent, or any rate trending upward across upgrades, is a flag that the system is brittle to events it does not control.
What is eval suite freshness?
Time since the production eval set was last refreshed with new representative inputs from production traffic. Healthy is refreshed within the last 90 days with at least 15 percent of the set rotated. Stale eval sets produce high scores that do not predict production performance.
How does production traffic share matter?
A system that passes its eval but only handles 5 percent of production traffic is in pilot, not in production. The QBR has to surface the gap between eval pass and traffic share. Healthy is approaching the planned production share with the remainder in deliberate shadow or human-in-the-loop fallback. Eval passing with traffic share below 30 percent and no roadmap to ramp is a flag.
What is the named next-gate kill criterion?
The criterion, stated in eval points and unit cost terms, that would kill the project at the next 90-day, 12-month, or 24-month gate. It forces the team to own what failure would look like in writing before the failure shows up. Vague criteria (“we’ll see how it goes”) are not criteria.
How do the metrics cluster together?
Four clusters: correctness (1, 2, 7, 8), economics (3, 4, 9), fragility (5, 6), governance (10, 11). Reading any single cluster in isolation produces a misleading picture; reading many four together produces the full signal a sponsor needs to make a real funding decision.
What should not be on the QBR?
Story points, sprint velocity, raw feature counts, isolated model benchmark scores, customer sentiment without an eval anchor, raw token cost without normalization to the workload unit. These belong in the appendix at most. The discipline of leaving them off is the discipline of admitting the QBR is a funding decision, not a status report.
How long should an AI QBR take?
One hour, with many four clusters surfaced. The eleven metrics fit in eleven slides or one tightly written four-page memo. A QBR that runs ninety minutes is usually one that is hiding the eval data inside narrative.
Key takeaways
- A QBR for an AI project replaces sprint velocity and features shipped with eval delta, unit cost trajectory, regression rate, and gate position.
- The eleven metrics cluster into correctness (eval score, delta, freshness, traffic share), economics (cost per unit, trajectory, retainer SLA), fragility (regression rate, triage time), and governance (milestone variance, gate position with kill criterion).
- Eval score in isolation is a quality measurement; cost per unit is the other half of the unit economics. They are co-determined and must be read together.
- Regression rate at the most recent model upgrade is a proxy for how brittle the system is to external events. Below 10 percent is healthy; above 15 percent is a flag.
- Eval suite freshness; refreshed within 90 days with 15 percent rotation; prevents the high-score-on-stale-test problem.
- Stage-gate position must include a named kill criterion in eval-points and unit-cost terms. Vague criteria are not criteria.
- Story points, raw feature counts, isolated benchmark scores, sentiment without eval anchor, and raw token cost belong in the appendix at most.
- The QBR is a funding decision, not a status update. One hour, eleven metrics, four clusters, named asks.
The right metrics do not just describe the project. They determine which 90 days fund and which 90 days do not; and they do it on evidence the team and the sponsor can both audit in the same room.
Arthur Wandzel