The AI agency project Kanban that actually maps to LLM workflows

The default Linear or Jira Kanban is wrong for AI work, and using it is a quietly expensive choice. Backlog → In Progress → Review → Done is a board built for CRUD; for tickets that are either “not started, started, or finished.” AI engineering is not built that way. Most meaningful unit of work passes through a research phase, an eval-design phase, an implementation phase, an eval-passing phase, and a monitoring phase, with a permanent regression queue running alongside. Mapping this onto a four-column board produces a project that looks healthy on the dashboard while quietly accumulating debt in most column. This is the eval-gated Kanban that maps to LLM delivery work, and the case for why the standard board is a misleading instrument.

The frame for this redesign sits inside the AI agency manifesto: the unit of progress in 2026 AI delivery is the eval-gated PR, not the closed ticket. A board that does not surface eval state is a board that hides the only thing that matters. The Kanban described below is the operational implementation of that stance; a board where eval state is a column, not a metadata field.

The argument is not that Linear and Jira are bad tools. They are excellent tools for the work they were designed for. The argument is that the default templates these tools ship with; including their AI-engineering templates as of mid-2025; model a workflow that LLM work does not have, and using them creates a friction tax that compounds across the engagement. The fix is not a new tool; it is a board configuration that respects the actual flow of AI engineering work.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Why Backlog → Doing → Done fails for AI work

The standard board has three problems when applied to AI work, each of which would be small in isolation and is large in combination.

Problem 1: it conflates “implementation” with “eval-passing.” A ticket marked “Done” in a CRUD codebase means the feature is shipped. A ticket marked “Done” in an AI codebase means the feature is shipped and the eval suite agrees that it is shipped; and those two are different events, sometimes separated by days. A board that has only one “Done” column either marks tickets done before the eval is green (in which case “Done” is a lie), or holds tickets in “In Progress” until the eval is green (in which case “In Progress” is a lie about how much active engineering work remains). Both lies show up in throughput dashboards as misleading numbers.

Problem 2: it has no visibility on research and eval-design. A senior engineer spending three days on eval-design; writing ground-truth cases, picking the right rubric, defining the threshold; is doing the most important work of the week. On a Backlog → Doing board, that work is either “in progress” indistinguishably from coding, or it is invisible because it does not look like a coded ticket. Either way, the board fails to honor the work that gates everything downstream. The senior reviewer leverage problem described in why most AI agencies underprice senior reviewers is partly a Kanban problem: when the gate work is invisible, it gets under-prioritized and under-staffed.

Problem 3: there is no place for regression work. When a model upgrade or a prompt change causes an eval delta to swing the wrong way, the work to investigate and fix it is not a new ticket; it is a regression on existing work. On a standard board, regressions either reopen old tickets (which corrupts the throughput metric) or get filed as new tickets (which corrupts the dependency graph). Neither approach is honest. AI work has a permanent regression queue that needs its own surface, not a hack on the existing flow.

Together, these three problems produce a board that misleads the team, the agency, and the client about where the work is and what is at risk. The fix is a different board.

The eval-gated Kanban

The board has seven columns and one parallel queue. The seven columns are sequential and represent the actual flow of AI engineering work; the parallel queue is permanent and runs alongside the sequential flow.

Column 1: Backlog. Tickets that are scoped enough to work on but not yet started. WIP limit is unlimited (it is a queue, not a workspace).

Column 2: Researching. The ticket has been picked up and the engineer is doing the research that has to happen before either eval-design or implementation: reading the prior art, characterizing the data, sketching the approach. The output of this column is a short written research note attached to the ticket; usually 200–500 words covering “what we know, what we don’t, what we’ll try first.” WIP limit per engineer is 1.

Column 3: Eval-design. The ticket has clear enough framing that the team can define what “passing” looks like. This column produces ground-truth eval cases (typically 5–15 for a feature ticket), the pass/fail criterion, and the threshold the implementation must clear. The output is committed to the repo as a diff to evals/ or as a new file. WIP limit per engineer is 1; this is the gating column and it deserves protection.

Column 4: Implementing. The traditional “in progress” column, but with the explicit understanding that the eval cases already exist. The engineer is writing code against an eval suite, not in a vacuum. The output is a PR opened against main with the eval delta in the description. WIP limit per engineer is 2; engineers can have at most one ticket implementing while another is awaiting code review.

Column 5: Eval-passing. The PR is open and the eval gate is running. If the eval delta is below the threshold, the ticket sits here while the engineer iterates on the implementation. If the delta meets or exceeds the threshold, the PR is approved and the ticket moves to Shipping. WIP limit per engineer is 2; this column tends to have throughput because PRs sit here for hours, not days.

Column 6: Shipping. The PR is merged and is in deploy. Includes staging deploy, canary, and full production rollout if applicable. The output is the deployed version with monitoring active. WIP limit per engineer is 2. This column should be short; if tickets pile up here, the deploy pipeline is the bottleneck, which is a process problem worth addressing immediately.

Column 7: Monitoring. The ticket has shipped and is being monitored against the production eval slice for at least 7 days (or whatever the engagement-defined burn-in period is). If production eval deltas remain above the threshold across the burn-in, the ticket moves to Done. If they regress, the ticket moves to the regression queue. WIP limit is unlimited; this is a queue of “in observation” work.

Column 8 (parallel): regression queue. Tickets that were closed and have since regressed against the eval suite. Sources include monthly model-upgrade testing, production-traffic eval drift, and prompt revisions that broke previously passing cases. The regression queue has its own WIP limit per engineer (1) and its own priority; regressions are addressed before new feature work, with explicit override required if the team chooses otherwise. This is the column that does not exist on a standard board and is responsible for most of the silent debt in AI engagements without one.

WIP limits and what they enforce

WIP limits are not bureaucratic decorations; they are the part of the system that keeps the board honest. The recommended limits per engineer above add up to a discipline: at most 7 tickets in motion across the seven flow columns plus the regression queue, which is plenty to keep the engineer busy while preventing context-switching that degrades both code and eval quality.

The most important WIP limits are on Eval-design and Implementing. An engineer with three eval-designs in flight is, in practice, designing none of them well; the eval-design column is taste-driven work that does not parallelize. An engineer with five tickets implementing is producing code that none of them will ever review carefully, which is the failure mode that makes mid-level engineers look unproductive even when they are working hard. The limits force the work to be sequential at the level where sequencing matters and parallel where parallelism is cheap (Backlog and Monitoring).

The team-level WIP limit on the regression queue is also important. A regression queue with more than two open items per agent in production is a signal that the team is shipping faster than they are stabilizing; the eval suite is detecting the failure but the team is not addressing it, and the gap will eventually become a customer incident. A standing rule of “no more than two open regressions per production agent before new feature work is paused” keeps the discipline honest.

Swimlanes by eval domain

The columns described above are the flow. Swimlanes are the orthogonal axis: each swimlane represents an eval domain, and tickets flow horizontally within their lane.

A typical engagement has three to five eval domains, depending on what the agent is doing:

Accuracy; does the agent produce the correct output?
Cost; is the agent within the cost-per-call ceiling?
Latency; is the agent within the P95 latency budget?
Safety; does the agent refuse out-of-scope requests, redact PII, resist prompt injection?
Cost-of-incidents; for agents with human-in-loop, what fraction of outputs require human override?

Swimlanes matter because different eval domains have different gating columns. A ticket in the Cost lane spends most of its time in Eval-design and Implementing; a ticket in the Safety lane spends most of its time in Eval-design and Monitoring (because safety failures often only surface under production traffic). A board without swimlanes treats many evals as equivalent, which is a misread of the actual work.

The swimlane structure also makes resource allocation legible. If the Accuracy swimlane has 12 tickets and the Safety swimlane has 0, the team is over-investing in the visible work and under-investing in the invisible work; a pattern that produces agents that pass demos and fail audits. The swimlane balance is a leading indicator that the board surfaces and a Backlog → Doing → Done board hides.

Why Linear and Jira default boards mislead AI delivery teams

The point of this piece is not that Linear or Jira are inadequate. The point is that the default templates these tools ship with; even the AI-flavored ones launched in 2024 and 2025; assume a CRUD workflow with optional metadata fields for “model” or “prompt version.” That metadata is useful but does not solve the structural problem: the board’s columns model the wrong workflow.

The misleading effect is most acute on three audiences. Project managers who pull metrics from the default board over-report throughput because tickets marked Done include eval-failed work that has been informally rolled back. Clients who look at the board over-trust progress because Monitoring and the regression queue are invisible. Senior engineers who look at the board under-prioritize the eval-design and regression work because it is structurally invisible compared to “tickets I have moved.” Many three effects compound: the metric lies, the client trusts the lie, and the team optimizes for the metric rather than the work.

Reconfiguring the columns and swimlanes does not require new tools. It requires that the agency tech lead explicitly designs the board for AI work on day 1 of the engagement and trains the team to use it. The reconfiguration is part of the AI agency operating system; the standardized board configuration that ships with most engagement and makes throughput legible the same way across clients.

The honest critique of this board

The board described here has costs. It is more complex than the default. New engineers take longer to learn it. PMs who are accustomed to Backlog → Doing → Done need a 30-minute onboarding to understand why “Eval-passing” is a separate column from “Implementing.” The regression queue feels foreign until the first regression hits, after which it feels obvious.

The cost is worth paying because the alternative; using a board that does not match the work; produces work that does not match the board. AI delivery teams that adopt the eval-gated Kanban tend to converge on the discipline within two engagements; teams that resist the change tend to end up reinventing the columns through ad-hoc tags and metadata fields anyway, with worse legibility and no shared vocabulary. The board is the cheapest part of the operating discipline. Skipping it is a false economy.

The honest takeaway is that AI engineering work has a different shape than the work the default Kanban tools were designed for, and the agencies that respect that difference move faster, see their work more clearly, and produce throughput numbers that are not lies. The board is a small thing. It is also one of the higher-leverage small things to fix, because the data on the board becomes the data the agency uses to plan the next engagement, price the current one, and justify the senior premium that the eval-design column makes visible. A board that hides the gate work hides the case for paying seniors what they are worth.

Arthur Wandzel is the founder of SFAI Labs. The eval-gated Kanban described here is the standard board configuration shipped on most SFAI engagement and is the basis for the agency’s throughput, capacity-planning, and senior-allocation models.

Frequently Asked Questions

Why doesn’t the standard Backlog → Doing → Done Kanban work for AI engineering?

Three reasons. First, it conflates implementation with eval-passing; a CRUD ticket marked Done means the feature shipped, but an AI ticket marked Done means both shipped and the eval suite agrees, which are different events sometimes separated by days. Second, it has no visibility on research and eval-design work, which is the gating work that determines downstream success. Third, there is no place for regression work, so model upgrades and prompt changes that swing eval deltas have nowhere to live without corrupting the throughput metric. The standard board produces a project that looks healthy on the dashboard while accumulating debt in most column.

What columns should an AI agency Kanban board have instead?

Seven sequential columns plus one parallel queue. Sequential: Backlog, Researching, Eval-design, Implementing, Eval-passing, Shipping, Monitoring. Parallel: regression queue. The seven columns model the actual flow of LLM engineering work; a senior researches, designs evals, implements against those evals, waits for the eval gate, ships, and monitors before declaring the work done. The regression queue runs alongside and holds tickets that were closed and have since regressed against the eval suite due to model upgrades, traffic drift, or prompt changes.

What WIP limits should an AI Kanban board enforce?

Per engineer: 1 in Researching, 1 in Eval-design, 2 in Implementing, 2 in Eval-passing, 2 in Shipping, unlimited in Monitoring, 1 in the regression queue. Backlog and Monitoring are unlimited because they are queues. The most important limits are on Eval-design and Implementing because eval-design is taste-driven work that does not parallelize and Implementing controls how much code review the engineer can give each PR. Team-wide: no more than two open regressions per production agent before new feature work is paused.

What are eval-domain swimlanes and why do they matter?

Swimlanes are the orthogonal axis to the columns; each swimlane represents an eval domain (Accuracy, Cost, Latency, Safety, Cost-of-incidents) and tickets flow horizontally within their lane. They matter because different eval domains have different gating columns; a ticket in the Cost lane spends most of its time in Eval-design and Implementing, while a ticket in the Safety lane spends most of its time in Eval-design and Monitoring. The swimlane structure also makes resource allocation legible: if Accuracy has 12 tickets and Safety has 0, the team is over-investing in visible work.

Why is the regression queue a separate column instead of new tickets?

Because regressions are not new work; they are existing work that has degraded. Filing them as new tickets corrupts the dependency graph and the throughput metric, since the original ticket is still ‘done’ even though the work has unwound. Reopening old tickets also corrupts the throughput metric. A separate regression queue treats regression work as first-class, surfaces the rate of regression as a leading indicator of agent reliability, and gives the team a place to prioritize stabilization against new feature work. Without it, regression work lives in shadow systems and gets under-prioritized.

Should AI agencies use Linear, Jira, or build their own tool for this?

Use Linear or Jira and reconfigure the columns. Both tools support custom columns and swimlanes; the issue is that their default templates ship with a CRUD workflow. The reconfiguration takes a tech lead about an hour on day 1 of the engagement and produces a board that respects AI engineering’s actual flow. Building a new tool is overkill; the tools are excellent for the work they were designed for; only the default templates are wrong. The fix is configuration, not migration.

How long does it take a team to adapt to the eval-gated Kanban?

About two engagements. New engineers take longer to learn the board than the default Backlog → Doing → Done, and PMs accustomed to the legacy template need a 30-minute onboarding to understand why Eval-passing is a separate column from Implementing. The regression queue feels foreign until the first regression hits, after which it feels obvious. Teams that adopt the board converge on the discipline within two engagements; teams that resist tend to reinvent the columns through ad-hoc tags and metadata fields anyway, with worse legibility.

How does the board change client-facing reporting?

It makes the eval gate visible to the client and turns throughput into a number that does not lie. Instead of ‘we shipped 12 tickets this week’ (which conflates eval-failed and eval-passing work), the report becomes ‘we shipped 8 tickets that cleared the eval gate, 3 are in eval-passing iteration, 1 is in the regression queue from last month’s model upgrade.’ The client sees the work, sees what gates it, and sees the regression rate as a leading indicator. This conversation is impossible on a Backlog → Doing → Done board because the numbers cannot be trusted.

Does the eval-gated Kanban work for non-AI work mixed into the same engagement?

Yes, with a small caveat. Non-AI tickets; infrastructure changes, plumbing work, documentation; flow through the same columns but skip the Eval-design and Eval-passing columns by convention. They move from Researching to Implementing to Shipping to Monitoring. The same WIP limits and swimlane discipline apply. The board does not require most ticket to be an AI ticket; it requires that AI tickets are not collapsed into a workflow that hides their structure. Mixed engagements use the full board with non-AI tickets short-circuiting the eval columns.

What is the single biggest impact of switching to this board?

Visibility on senior reviewer leverage. The Eval-design column makes the work that gates everything else explicitly visible, which forces the agency to staff and price it correctly. On a default board, eval-design is invisible; buried inside ‘In Progress’ or done in Slack threads; and the senior reviewer’s leverage is consequently invisible to the client and the agency’s own throughput model. The eval-gated Kanban is the cheapest place to make that leverage measurable, which has implications for staffing, pricing, and the case for the senior premium that follows.

The AI agency project Kanban that actually maps to LLM workflows

Decision Scope

Why Backlog → Doing → Done fails for AI work

The eval-gated Kanban

WIP limits and what they enforce

Swimlanes by eval domain

Why Linear and Jira default boards mislead AI delivery teams

The honest critique of this board

Frequently Asked Questions

Why doesn’t the standard Backlog → Doing → Done Kanban work for AI engineering?

What columns should an AI agency Kanban board have instead?

What WIP limits should an AI Kanban board enforce?

What are eval-domain swimlanes and why do they matter?

Why is the regression queue a separate column instead of new tickets?

Should AI agencies use Linear, Jira, or build their own tool for this?

How long does it take a team to adapt to the eval-gated Kanban?

How does the board change client-facing reporting?

Does the eval-gated Kanban work for non-AI work mixed into the same engagement?

What is the single biggest impact of switching to this board?

See how companies like yours are using AI

Related articles

The 10x Developer Used to Be a Unicorn — Now We're Approaching the 1000x Paradigm

A field guide to evaluating an AI agency in under 90 minutes

Agentic AI Development: Tool Use and Function Calling

Where ideas become AI products

Company

General

Case Studies

Services

Resources