Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 13 min read

Stop budgeting AI projects in story points. Budget them in eval runs.

Stop budgeting AI projects in story points. Budget them in eval runs.

Story points are a 2007-era scrum tool for measuring relative effort across a team’s implementation hours. They were rarely designed for work whose bottleneck is a non-deterministic system passing or failing an eval at a target threshold. Trying to estimate a 13-point AI feature is estimating the wrong unit, and the variance; week-three sprints that bleed into week-six because the eval threshold did not move; is the structural artifact of the wrong unit, not the team’s failure to estimate harder. The replacement is to budget AI work in eval runs.

This piece is prescriptive. The 13-point feature becomes “8 eval runs at 70 percent threshold.” The sprint becomes a budget of eval-run capacity. Velocity becomes eval-runs-per-sprint, not points-per-sprint. The argument extends the AI project economics manifesto’s principle that evaluation is the unit of account into agile sprint planning, where the rest of the engineering org runs.

Why story points break for AI work

Three assumptions break for AI features.

Effort dominates uncertainty. A 5-point CRUD feature is roughly twice the work of a 3-point one because both are mostly implementation. AI features invert this; implementation hours for a RAG feature might be small; eval-convergence cycles might be 5× the implementation. Story points compress two unrelated dimensions into one indefensible number.

The bottleneck is throughput. Story points assume the constraint is how much the team can implement per sprint. AI work bottlenecks on eval-pass cycles; try a prompt, run the eval, see the score, refine, run again. The constraint is how fast the suite grades and how many cycles to converge. A team shipping 30 CRUD points/sprint cannot ship 30 AI points/sprint.

Completion is binary. Story stories complete when acceptance criteria are met. AI features have a continuous completion criterion; eval score against threshold. A feature at 0.78 against a 0.85 target might need three more cycles or eight depending on failure modes. Story points cannot represent partial completion; eval runs can.

The cumulative effect: AI sprints estimated in story points are structurally over-committed. The team estimates implementation hours, the eval cycle takes longer, the sprint slips, retrospectives blame “underestimation,” and the team adjusts points upward in a feedback loop that rarely converges because the unit is wrong.

What an eval run is

One eval run is one execution of the named eval suite against the current state of the system, producing a score against the threshold. The scoring is mechanical; pass or fail, with the score logged. The work that produces an eval run includes the prompt or retrieval changes since the previous run, the time to run the suite, and the time to interpret the results.

A typical eval run in 2026 takes 30 to 90 minutes wall-clock, with 1 to 4 engineering hours spent reviewing the results. A run that fails its threshold consumes the same time as one that passes; both are units of progress.

Three properties make eval runs the right unit for AI work.

Eval runs are non-arbitrary. A run is logged with a run ID, a score, a timestamp, and a diff from the previous run. There is no judgment about “how big” it was; the artifact is concrete. Story points require the team to vote on size; eval runs do not.

Eval runs map to the bottleneck. The work between runs is the implementation work. The runs themselves are the cycles that determine when the work is done. Counting runs measures the actual constraint on AI feature progress, not a proxy for it.

Eval runs accumulate as evidence. The history of eval runs for a feature is the project’s audit trail of how the feature converged on its threshold. Story-point histories are folklore. Eval-run histories are data; see the AI agency quality system: evals, observability and weekly review for how mature teams operationalize the run history.

How to estimate in eval runs

Three estimation approaches.

Reference-class forecasting. A team that has shipped six AI features knows roughly how many runs each took to converge; faithfulness thresholds typically 4 to 12, latency optimization 3 to 8, cost-per-call 5 to 15. New features get estimated against the most similar past feature with a multiplier for novelty.

Threshold-difficulty heuristic. 70 percent threshold ≈ 3 to 6 runs. 85 percent ≈ 6 to 12. 92 percent ≈ 12 to 25. The relationship is non-linear because each marginal point above 80 percent typically requires a structural change (better retrieval, better prompts, sometimes a different model) rather than a refinement.

Failure-mode taxonomy. A feature whose failure modes cluster (one retrieval bug producing 30 percent of misses) converges fast; 3 to 6 runs. Scattered failure modes (10 small reasons at 3 percent each) converge slow; 10 to 20 runs. The taxonomy is built during the first 2 runs; estimates sharpen after run 2.

Worked example. A team ships a support agent at 85 percent faithfulness on a 240-prompt eval set. Reference class: two similar agents averaged 9 runs. Heuristic: 85 percent is 6 to 12. Budget: 10 runs. Sprint capacity at 4 runs/week × 2 engineers = 2.5 weeks. The conversation is “10 eval runs across 2.5 weeks” rather than “8 story points across 1 sprint.”

Sprint planning with eval-run budgets

A sprint planned in eval runs looks structurally different from a sprint planned in story points.

The sprint goal is not “ship features X, Y, Z.” The goal is “feature X reaches 85 percent eval threshold (budget: 8 runs); feature Y reaches 75 percent (budget: 5 runs); feature Z’s eval set v1 is locked (budget: 3 runs).” The total run budget for the sprint is the team’s run capacity; typically 12 to 20 runs per sprint depending on suite size and team count.

Run capacity is finite and visible. A sprint that commits 24 runs against a 16-run capacity is over-committed in a way the team can see at planning time, not discover at sprint-end. Compare with story points, where over-commitment is invisible until the sprint ends.

Run carryover is meaningful. If feature X used 6 runs and reached 78 percent against an 85 percent target, the carryover is “feature X needs an additional 2 to 4 runs” rather than “feature X is 2 points over.” The conversation about whether to push on or descope is grounded in real numbers.

Run capacity scales linearly with team and eval-suite parallelism. Adding engineers does not add capacity unless the eval suite supports parallel runs. This makes the eval suite’s CI infrastructure visible as the actual constraint on team velocity; see why mobile apps cost more than web apps for the broader pattern of how infrastructure constraints show up as project economics.

Tracking velocity: eval runs per sprint

Velocity in an eval-run regime is runs per sprint. The number is more stable than points-per-sprint because it tracks a real artifact. Three uses.

Sprint capacity planning. A team running at 16 runs per sprint plans the next sprint at 14 to 18 runs. A team that hit 22 runs last sprint and 12 the sprint before knows its eval-suite throughput is volatile and plans against the lower bound.

Forecasting feature completion. A feature budgeted at 10 runs in a team running 16 runs per sprint will finish in 0.6 sprints, not 1. The forecast is honest because the unit is real.

Detecting infrastructure regressions. A team whose runs-per-sprint drops from 18 to 12 in three sprints has lost eval-suite throughput, usually because the suite grew and the parallel-run infrastructure did not. The metric surfaces the regression early. Story-point velocity drops would have been blamed on team morale; eval-run velocity drops point at the suite.

The transition from points to runs

One to two sprints for a willing team. Three steps.

Step 1: instrument eval runs. Each run logged with run ID, score, threshold, diff, and feature association. Most teams already have this in Promptfoo, Inspect, OpenAI Evals, or custom; the new discipline is associating runs with features in sprint planning.

Step 2: estimate the next sprint in runs. Estimate AI features in eval runs using reference-class forecasting. Keep story points alongside for the first sprint as a reference. Compare at sprint-end.

Step 3: drop story points for AI work. After two parallel sprints, drop story points for AI features. Keep them for non-AI work where they still make sense.

The hardest part is cultural. A team that has run on story points for three years has scar tissue around them; capacity spreadsheets, retrospective rituals, manager dashboards. The benefit, after transition, is that AI sprints stop bleeding into next sprints because the unit was honest from the start.

Common objections

“My PM has been planning in story points for five years.” Then keep story points for non-AI work and use eval runs for AI work. The two units coexist. The PM’s planning skill in points transfers to runs because the underlying skill; forecasting how long work takes; is unit-independent.

“Eval runs don’t capture the implementation effort.” Correct, and not a problem. The implementation effort is fully visible in the work between runs; the runs are the cycles that gate completion. A feature that takes 2 hours of implementation per cycle and 8 hours across 4 cycles is a 4-run feature. The implementation hours are tracked separately if the team wants them; runs are the budget.

“What about AI work that isn’t eval-gated?” Pure-research work without an eval bar runs on time-and-materials at the team level the same way the case against fixed-price AI development contracts describes for the contract level. Eval runs apply to feature work where there is a target threshold, which is most production AI work.

“This feels like more work.” It is less work, after the transition. The volatility of story-pointed AI sprints; the slips, the over-commits, the retrospective postmortems; is invisible work the team is already doing. Eval-run budgeting surfaces the work into the sprint plan where it can be governed.

Frequently asked questions

Why do story points break for AI work?

Three structural reasons. Effort does not dominate uncertainty for AI work; the cycles required to converge on an eval threshold dominate. The bottleneck is eval-pass cycles, not implementation throughput. Completion is continuous (eval score against threshold), not binary. Story points compress many of this into a number that cannot be defended.

What is an eval run?

One execution of the named eval suite against the current state of the system, producing a score against the threshold. Each run is logged with a run ID, score, timestamp, and diff from the previous run. A typical run takes 30 to 90 minutes wall-clock with 1 to 4 engineering hours of result review.

How do I estimate a feature in eval runs?

Three approaches. Reference-class forecasting against past features the team has shipped. Threshold-difficulty heuristic; 70 percent thresholds typically take 3 to 6 runs, 85 percent 6 to 12, 92 percent 12 to 25. Failure-mode taxonomy after the first 2 runs sharpens the estimate.

What does sprint planning look like in eval runs?

The sprint goal names eval-threshold targets and run budgets per feature. Total run budget equals team run capacity (typically 12 to 20 runs per sprint). Over-commitment is visible at planning time. Carryover is meaningful; “feature X needs 2 to 4 more runs” rather than “feature X is 2 points over.”

How is velocity tracked?

Runs per sprint. The number is more stable than points-per-sprint because it tracks a real artifact. Used for sprint capacity planning, feature-completion forecasting, and detecting infrastructure regressions when the eval suite’s throughput degrades.

What about AI work that isn’t eval-gated?

Pure-research work without an eval bar runs on time-and-materials at the team level. Eval runs apply to feature work where there is a target threshold, which is most production AI work. The two regimes coexist in a typical sprint.

How long does the transition take?

One to two sprints for a willing team. Step 1: instrument eval runs in the existing framework. Step 2: estimate the next sprint in runs alongside story points. Step 3: drop story points for AI work after two parallel sprints. Non-AI work keeps story points; the unit is matched to the work.

Does this break PM tooling like Jira?

No. Jira accepts custom fields for eval-run estimates the same way it accepts story points. The change is in the unit and the planning ritual, not the tool.

Does this apply to internal AI teams or only agency engagements?

Both. The unit-of-estimation problem is identical inside and outside the agency. Internal teams running scrum on AI work face the same wrong-unit failure as external engagements; the fix is the same.

How does this connect to the AI project economics manifesto?

The manifesto names evaluation as the unit of account at the project level. Eval runs are the unit of account at the sprint level; the operational consequence of manifesto principles in the day-to-day rhythm of engineering work. Without this unit at the sprint level, manifesto principles do not survive contact with the sprint board.

Key takeaways

  • Story points break for AI work on three structural assumptions: effort does not dominate uncertainty, the bottleneck is eval-pass cycles not implementation throughput, and completion is continuous not binary.
  • Eval runs are the right unit because they correspond to real artifacts (run ID, score, timestamp), map to the actual bottleneck (eval-pass cycles), and accumulate as evidence rather than folklore.
  • Estimation in eval runs uses reference-class forecasting, a threshold-difficulty heuristic (70 percent ≈ 3-6 runs, 85 percent ≈ 6-12, 92 percent ≈ 12-25), and a failure-mode taxonomy sharpened after the first 2 runs.
  • Sprint planning in eval runs surfaces over-commitment at planning time, makes carryover meaningful, and exposes the eval suite’s CI infrastructure as the real constraint on team velocity.
  • The transition takes one to two sprints. Story points coexist with eval runs; keep them for non-AI work where they still make sense, drop them for AI work where they rarely did.

Last Updated: Jun 10, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles