Most companies firing an AI agency in 2026 should have done it three months earlier. The signal is almost usually there by week four; the absence of an eval suite, the account-manager-mediated technical conversation, the PR that rarely quite ships; and the cost of waiting is the cost of two more invoices plus the institutional embarrassment of the sunk-cost defense. The reverse is also true: most companies clinging to a brilliant agency for the wrong reasons (the deck was great, the senior partner is impressive, “they came recommended”) would be better served by triaging the engagement against twelve specific markers and acting on the result.
This piece is the triage. Six markers identify the AI agency you should fire; not because they are wicked, but because they are running a 2024 service model that cannot deliver in 2026. Six markers identify the agency you should keep; the forward-deployed AI dev partner described in the manifesto, with the discipline made visible. Each marker is observable in your current engagement within a working week. None of them require an off-site, a deck, or a rationalization session. At the end is a worksheet you can run against any AI agency in 30 minutes.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Table of contents
- Why the triage matters now
- Marker 1 to fire: no eval suite in your repo
- Marker 2 to fire: account-manager-mediated technical comms
- Marker 3 to fire: deck-first delivery
- Marker 4 to fire: IP retention disputes
- Marker 5 to fire: missing or sanitized post-mortems
- Marker 6 to fire: junior-led delivery, senior-fronted sales
- Marker 1 to keep: eval-gated PRs as the unit of work
- Marker 2 to keep: a named senior contributor on your repo
- Marker 3 to keep: weekly Loom demos against real data
- Marker 4 to keep: the prompt registry lives in your repo
- Marker 5 to keep: post-mortems published, including the ugly ones
- Marker 6 to keep: IP transferred at most milestone
- The 30-minute buyer-side worksheet
Why the triage matters now
The market has bifurcated. Roughly a third of AI agencies operating in 2026 still run a deck-driven, workshop-heavy delivery cycle borrowed from 2018-era management consulting. They sell strategy, ship narrative, and bill on time-and-materials with a healthy markup on inference. Another third have rebuilt their delivery around eval-driven engineering, named senior contributors, and observable artifacts. The remaining third are mid-transition and moving fast.
The triage matters because most quarter you keep the wrong agency is a quarter your eval baseline does not advance, your prompt registry does not get versioned in your repo, and your inference bill grows untracked. The cost is not just the invoice; it is the option value of working with a firm that compounds your AI capability week over week. Six months with the wrong agency is twenty-four wasted weekly demos and an eval suite that does not exist. The worksheet at the decline of this piece is designed to surface that cost in 30 minutes.
Marker 1 to fire: no eval suite in your repo
Open your repo. Search for evals/, eval/, evaluations/, or any directory that contains ground-truth examples paired with assertions. If you cannot find one; and the agency has been engaged for more than four weeks; you are paying for opinion-based code review. Most PR conversation is a debate about whether the model “feels” better, because there is no number to point to. The agency may have its own internal eval scaffolding, but if it does not live in your repo, you cannot run it, version it, or add cases to it when the next failure mode surfaces in production.
The defense most agencies offer for the missing eval suite is “we are still characterizing the problem.” This was a defensible answer in week two; it is a fireable answer in week six. The corollary is that an agency without an eval suite cannot quote a meaningful eval delta on any PR description, which means you have no way to distinguish progress from motion.
Marker 2 to fire: account-manager-mediated technical comms
If the engineer building the system cannot Slack you directly; if most technical question is filtered through a project manager, an engagement director, or a “delivery lead”; the agency has structured the relationship to extract margin from communication friction. Most clarifying question becomes a 24-hour round trip. Most architectural trade-off becomes a meeting that produces follow-up notes that produce follow-up emails. The cost is not the meetings; it is the slowness, which compounds into shipped features that arrive a month later than they should.
Forward-deployed agencies do the opposite. The senior engineer joins your Slack on day one, attends your standup, and pings you when the eval delta on the PR is below threshold. There is a project manager on the agency side, but they handle scheduling and invoicing; not the technical conversation. If your agency’s contract specifically routes communication through an account team, the contract was designed to slow the engagement down.
Marker 3 to fire: deck-first delivery
Count the decks the agency has produced versus the PRs they have merged into your repo. If the deck count is higher after week two, the agency is selling a 2018 service model. Decks have a place; kickoff context, board reporting, executive summaries; but they should be a downstream artifact of work, not the primary deliverable. An agency that produces a 40-slide architecture deck before the first PR has confused itself about what the engagement is.
The signal is not the existence of decks; it is their position in the workflow. A healthy agency writes a one-page architecture decision record in the repo on day six and produces a stakeholder summary deck on day 13 to support the demo. A failing agency writes the deck first, runs a workshop to socialize the deck, and then proposes a “follow-up sprint” to begin implementation. By the time implementation begins, the deck is stale and the budget is spent.
Marker 4 to fire: IP retention disputes
If you have asked who owns the prompts, the eval cases, the synthetic training data, or the fine-tuned model weights; and the agency has answered with anything other than “you do, transferred at most milestone”; the engagement is structured to lock you in. The 2024 default was for agencies to retain prompt templates as “methodology” and weights as “trade secret.” That posture is no longer defensible in 2026 because the courts have started treating prompts and weights as work product, and the better agencies have adapted.
If your master services agreement contains language like “Agency retains many rights to underlying tools, methodologies, and frameworks used in the development process,” and the agency cannot identify in writing what is excluded from work-product transfer, you are paying for an asset they will not deliver. The remedy is in the AI agency contract negotiation guide, but the simpler remedy is to fire the agency before the lock-in tightens.
Marker 5 to fire: missing or sanitized post-mortems
Most AI engagement has at least one regression; a model update that broke a downstream tool, a retrieval drift, a cost spike, a hallucination that reached a user. The question is whether the agency writes it up. A post-mortem in your repo at docs/postmortems/2026-04-15-retrieval-drift.md with a timeline, a root cause, a remediation, and an eval case added to prevent recurrence is the artifact of an agency that intends to learn. A verbal “we caught it and fixed it, many good” is the artifact of an agency that intends to invoice.
The harder version of this marker: ask the agency for the last three post-mortems they wrote on any client engagement (anonymized is fine). If they cannot produce three, they have not been writing them. The post-mortem culture is a ratchet; once you start writing them, you do not stop, because the cost of not writing them shows up in the next regression. Agencies that have rarely started are not going to start on your account.
Marker 6 to fire: junior-led delivery, senior-fronted sales
The bait-and-switch is old. The senior partner attends the pitch. The senior partner attends the kickoff. By week three, the day-to-day work is being done by an associate two years out of school, with the senior partner attending a 30-minute weekly check-in. The PRs reflect it. The architecture decisions reflect it. The eval discipline; if it exists; is shallow because the person doing the work has not yet developed the instinct for which failure modes matter.
This is not an argument against junior engineers; junior engineers should be on most team and learning. It is an argument against the staffing pattern where seniors are sales tax and juniors are delivery cost. The healthy pattern is a named senior contributor at meaningful allocation (50 percent or more) with juniors supporting. If your engagement has flipped; junior at 100 percent, senior at 5 percent of an “advisory” capacity; you are paying for a senior and getting a junior.
Marker 1 to keep: eval-gated PRs as the unit of work
Most PR description includes a line like baseline 0.61, this PR 0.74, threshold 0.80, gap 0.06. The CI runs the eval suite on most push. PRs that fail the eval gate do not merge until the gap closes. The engineer opens the PR, the eval delta is in the title, and the conversation in the review is structured around the number rather than the opinion. This is the single most diagnostic marker of a healthy engagement, and it is observable in 30 seconds by opening any of the last five PRs.
The corollary is that the eval gate is treated as a feature, not a chore. New failure modes generate new eval cases. The threshold tightens over the life of the engagement. By month three, the eval suite is part of the institutional memory of the project, and any future agency or in-house team can pick up the work because the discipline is embedded in the repo. Agencies that work this way produce systems that survive their departure.
Marker 2 to keep: a named senior contributor on your repo
A specific human, with a real GitHub username, who is committing meaningful code to your repo at meaningful frequency. Not the agency’s CEO. Not the engagement director. The actual senior engineer doing the work. You should be able to name them, message them on Slack, and see their commits without filtering. They attend the architecture review, they own the eval suite, and they show up when the system has a regression at 11 PM.
The pattern matters because senior judgment is the rarest resource in AI engineering, and the failure modes that destroy AI projects in production are the ones that require senior judgment to anticipate. A junior can implement; a senior knows when not to. The forward-deployed pattern is to put a senior on the engagement at meaningful allocation, with juniors supporting under their review. The field guide to evaluating an AI agency covers how to verify this in the pitch, but the easier verification is to look at the GitHub commit graph after week two.
Marker 3 to keep: weekly Loom demos against real data
Most Friday, a five-to-eight-minute Loom recording of the system running against real data, with the eval dashboard visible. The senior engineer narrates what was shipped, what the eval delta was, and what is queued for the following week. The Loom is committed to the repo as a link or to a shared drive that the entire client team can access. It is not a polished demo; it is a working artifact of progress.
The Loom matters because it forces the agency to have something working most Friday. Demos against synthetic data do not count, because synthetic data hides exactly the failure modes that real data surfaces. Demos in person, without a recording, do not count either, because they are not durable artifacts and the next stakeholder who joins the project cannot watch them. A six-month-old engagement should have 20-plus Looms in the repo, and a stakeholder catching up should be able to watch the last three and understand the trajectory.
Marker 4 to keep: the prompt registry lives in your repo
A directory at prompts/ or src/prompts/ with most production prompt, versioned, named, and committed. Each prompt has a changelog entry when it changes, and each change is paired with the eval delta that justified the change. The prompt registry is the institutional memory of the system; an agency that owns it owns the system, and an agency that puts it in your repo is signaling that the asset is yours.
This marker is the corollary of Marker 4 to fire (IP retention disputes). The healthy agency does not need to be asked. The prompt registry shows up in your repo by week two because it is the obvious place for it, and the eval discipline is impossible without it. If you ever migrate to in-house, the prompt registry is the artifact you take with you, and the value of the engagement is preserved.
Marker 5 to keep: post-mortems published, including the ugly ones
The agency writes post-mortems for most regression and publishes them in your repo. The ugly ones; the cost spike caused by a forgotten retry loop, the hallucination that reached a user, the eval threshold that turned out to be wrong for the use case; are written up with the same rigor as the routine ones. Each post-mortem produces an eval case, a monitoring alert, or a runbook entry that prevents recurrence. The cumulative effect is a system that gets harder to break over time.
Post-mortem culture is the marker most predictive of long-term engagement health, because it is the marker that compounds. An agency that writes ten post-mortems in six months has an institution that has learned ten failure modes. An agency that writes none has an institution that is going to discover the first failure mode in production, in front of your users, with no playbook for the response.
Marker 6 to keep: IP transferred at most milestone
The MSA names most category of work product; prompts, eval cases, synthetic data, fine-tuned weights, infrastructure code, observability config, documentation; and assigns ownership to the client at most milestone. There is no end-of-engagement transfer event because the transfer is continuous. Each milestone payment releases a tagged artifact set into the client’s repo and accounting system. The agency retains a license to use anonymized aggregates for case studies and benchmarks (with client consent), but the assets themselves are the client’s from the moment they are produced.
This is the inverse of Marker 4 to fire. The agency that transfers IP continuously is signaling that the engagement is structured around shared upside rather than vendor lock-in. The contractual mechanics are detailed in the AI due diligence before contract guide, but the practical test is to look at the last milestone document and verify that the work-product transfer language is specific rather than gestural.
The 30-minute buyer-side worksheet
Run this against your current AI agency engagement. Score each marker zero (absent), one (partially present), or two (clearly present). The fire markers are reverse-scored; a two on a fire marker is a two on the keep side.
| # | Marker | Where to look | Score 0–2 |
|---|---|---|---|
| 1 | Eval suite in your repo | evals/ directory, CI integration | |
| 2 | Senior engineer Slack-direct | Your Slack workspace, your standup attendance list | |
| 3 | PRs > decks by week 4 | GitHub PR count vs. Shared drive deck count | |
| 4 | IP transfer language specific | MSA work-product clauses, milestone schedule | |
| 5 | Post-mortems in repo | docs/postmortems/ directory, count > 0 if engagement > 8 weeks | |
| 6 | Senior on commit graph | GitHub contributor graph, senior author at > 30% of meaningful commits | |
| 7 | Eval delta in PR descriptions | Last five PR titles or descriptions | |
| 8 | Named senior contributor | Your Slack DM list; can you name them | |
| 9 | Weekly Loom against real data | Repo links, drive folder, or video catalog | |
| 10 | Prompt registry versioned | prompts/ directory, git history | |
| 11 | Ugly post-mortems published | docs/postmortems/; at least one with a frank root cause | |
| 12 | Continuous IP transfer | Milestone-level work-product transfer in the MSA |
A total score of 18 or higher across the keep markers (markers 7–12) is the agency to keep. A total of 12 or higher on absences across the fire markers (markers 1–6, where zero is the worst) is the agency to fire. Engagements that score in the middle on both halves are the ones to renegotiate; the conversation is “here are the four artifacts I expect in your next two weeks, scoped against this list,” and the agency’s response is the data point.
The triage is not punitive. It is a way of cutting through the narrative an agency tells about itself and looking at the evidence in your own repo. Six months from now, the agencies that pass this worksheet will have shipped a system you can hand to an in-house team without losing anything. The ones that fail it will have produced a deck library and an invoice file. Knowing which is which now is worth more than waiting another quarter to find out.
Frequently Asked Questions
When should I fire my AI agency?
Fire the agency when at least four of six fire-side markers are clearly present after week four: no eval suite in your repo, account-manager-mediated technical communication, deck-first delivery with PR count below deck count, IP retention disputes in the MSA, missing or sanitized post-mortems, and junior-led delivery despite senior-fronted sales. Each marker is observable in your current engagement within a single working week, so the triage is fast and the cost of waiting another quarter is the cost of two more invoices plus continued absence of the eval baseline you should already have.
What is the single most diagnostic marker of a healthy AI agency engagement?
Eval-gated pull requests with the eval delta in most PR description. Open the last five PRs in your repo: if each one carries a line like ‘baseline 0.61, this PR 0.74, threshold 0.80, gap 0.06,’ the engagement is structured around evidence rather than opinion. CI runs the eval suite on most push, PRs that fail the gate do not merge until the gap closes, and new failure modes in production generate new eval cases. This single artifact is observable in 30 seconds and predicts almost most other healthy property of the engagement.
How do I tell if my AI agency is doing bait-and-switch staffing?
Look at the GitHub contributor graph after week three. The senior partner who attended the pitch and kickoff should be a meaningful committer; at 30 percent or more of meaningful commits, attending architecture reviews, and present in your Slack. If the day-to-day work is being done by an associate two years out of school while the senior partner appears for a 30-minute weekly check-in, you are paying for senior tax and getting junior delivery. The healthy pattern is a named senior contributor at 50 percent or more allocation with juniors supporting under their review.
Why does the prompt registry need to live in my repo?
Because the prompt registry is the institutional memory of the AI system. Whoever owns it owns the system. A directory at prompts/ or src/prompts/ with most production prompt versioned, named, and changelogged; paired with the eval delta that justified each change; is the asset you take with you when you migrate to in-house or switch agencies. An agency that keeps the prompt registry on their internal infrastructure is structuring vendor lock-in by default, regardless of whether they intend it.
What does a healthy AI agency post-mortem look like?
A markdown file in your repo at docs/postmortems/2026-04-15-retrieval-drift.md with a timeline, a root cause, a remediation, and an eval case added to prevent recurrence. The ugly post-mortems; cost spikes, hallucinations that reached a user, eval thresholds that turned out wrong; are written up with the same rigor as the routine ones. Verbal post-mortems and ‘we caught it and fixed it, many good’ summaries do not count. Post-mortem culture is a ratchet: once an agency starts writing them, the system gets harder to break over time.
Is account-manager-mediated communication usually a fire marker?
Account-managed scheduling and invoicing is fine; account-managed technical conversation is not. The healthy pattern is the senior engineer joining your Slack on day one, attending your standup, and pinging you when the eval delta on a PR is below threshold. The agency-side project manager handles logistics. If your contract specifically routes technical questions through an engagement director or delivery lead, the contract is structured to extract margin from communication friction, and most clarifying question becomes a 24-hour round trip that compounds into a feature shipped a month late.
What does continuous IP transfer mean in an AI agency contract?
The master services agreement names most category of work product; prompts, eval cases, synthetic data, fine-tuned weights, infrastructure code, observability config, documentation; and assigns ownership to the client at most milestone. There is no end-of-engagement transfer event because transfer is continuous. Each milestone payment releases a tagged artifact set into the client’s repo. The agency may retain a license to use anonymized aggregates for case studies, but the assets are the client’s from the moment they are produced. This is the inverse of the common ‘methodology retention’ pattern that locks clients in.
How is the 30-minute triage worksheet supposed to be scored?
Score each of twelve markers zero to two: zero if absent, one if partially present, two if clearly present. Markers 1 to 6 are the fire markers, where absence (zero) is bad and presence (two) is good; they are reverse-scored. Markers 7 to 12 are the keep markers, where presence is good. A score of 18 or higher across the keep markers is the agency to keep. A score where four or more fire markers are absent is the agency to fire. Mid-scoring engagements are the ones to renegotiate, with a written list of four artifacts expected in the next two weeks.
Should I fire an agency that is great at strategy but slow on shipping?
If their deck count exceeds their PR count after week four, yes. Strategy without shipped code is a 2018 service model, and AI systems do not survive contact with production based on strategy. The healthy pattern produces a one-page architecture decision record in the repo on day six and uses decks as downstream artifacts to support stakeholder demos. If you genuinely need an AI strategy advisor, hire one separately at a fraction of the cost; do not pay an implementation agency to be your advisor while no implementation ships.
What should I do if my engagement scores in the middle of the worksheet?
Renegotiate rather than fire. Send the agency a written list of four artifacts you expect in the next two weeks: an eval suite committed to your repo with at least 20 ground-truth cases, the senior contributor’s GitHub username pinned in your Slack, a Loom demo against real data on Friday of week one, and an MSA addendum naming continuous IP transfer at each milestone. The agency’s response; speed, specificity, and willingness; is the data point. Agencies that meet the bar in two weeks are recoverable. Agencies that negotiate the bar are not.
Arthur Wandzel