The next 18 months will sort AI agencies into two camps along a single axis: whether they publish raw eval scores in their case studies. Until 2025 it was defensible to omit them; the discipline was new, the formats were not standardized, and clients did not ask. None of those defenses survive 2026. Buyers know to ask. The eval formats have converged on a recognizable set (recall@k, faithfulness, latency P50/P95, cost-per-call). And the leading agencies have already started publishing, which means the practice will become table stakes within four quarters and a competitive disadvantage to omit shortly thereafter.
This piece argues that AI agencies should publish raw eval results; anonymized where required, but with the actual numbers; in most case study. It covers the four metric families that should appear, why most agencies still refuse, why the leading firms will publish anyway, and what the published format should look like. The argument is downstream of the forward-deployed AI dev partner standard described in the manifesto: if eval-gated PRs are the unit of work, then eval scores are the unit of evidence, and an agency that runs the discipline internally without publishing externally is leaving the strongest signal of its own competence on the table.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Table of contents
- The four metric families that belong in a published case study
- Why most agencies still refuse to publish
- Why the leading agencies will publish anyway
- The published format: what a real eval-disclosure section looks like
- Anonymization without erasure
- The competitive flywheel publishing creates
The four metric families that belong in a published case study
A useful eval disclosure covers four families. Most published case study should have at least three; case studies with one or zero are signaling that the eval discipline did not exist.
Quality metrics. For retrieval-augmented systems: recall@k (top-k accuracy on a held-out gold set), faithfulness (the proportion of generated claims that are supported by retrieved context), and answer correctness against a graded rubric. For agent systems: task completion rate (the share of tasks completed without human intervention) and tool-call accuracy (the share of tool invocations that match the intended call). The numeric format is consistent: a baseline score at engagement start, a deploy score at first production cut, and a stable score 60-plus days post-deploy. The trajectory matters as much as the endpoint.
Latency metrics. P50 and P95 time-to-first-token, P50 and P95 end-to-end response time, and the percentage of requests that hit the latency budget. Latency is the metric most often reported in marketing-grade form (“3x faster”) and most often missing in technical-grade form (the actual histogram). The case study should report at least P50 and P95, measured over a known request volume in a known time window.
Cost metrics. Cost-per-call (broken down by model invocation, retrieval embedding, re-ranking, and any verification calls), cost-per-month at production load, and the cost trajectory over the engagement window. Cost is the metric most clients care about most and the metric most agencies report least, because the answer often shows that the system became cheaper through engineering rather than through model substitution; which is the credit-allocation story most agencies do not want to tell.
Reliability metrics. Production error rate, fallback-trigger rate (how often the primary provider failed and the fallback was invoked), eval-gate pass rate on incoming PRs, and incident count with mean-time-to-detect and mean-time-to-resolve. These are the metrics that distinguish a system someone operates from a system someone shipped. An agency that publishes reliability metrics is signaling that they have stayed close enough to the system to know.
The four families together produce roughly 12 to 18 numbers per case study. That is the right surface area: enough to be actionable, small enough to read in three minutes.
Why most agencies still refuse to publish
Three reasons, in order of how often they are voiced versus how often they are real.
The voiced reason: legal and confidentiality. Clients sign NDAs that prohibit publication of “performance metrics” or “system architecture details.” This is the agency’s first line of defense, and in roughly a third of cases it is genuine; particularly with regulated clients, large enterprises, or systems that touch material non-public information. The remedy in those cases is anonymization, not omission, and it is covered below.
The half-voiced reason: competitive optics. The numbers may look weaker than the marketing copy implies. An agency that achieves a 0.78 faithfulness score does not want to publish it next to a competitor’s marketed “industry-leading accuracy,” even though the competitor is reporting a different metric on a different test set. The fear is that decontextualized numbers will be misread, which is sometimes true and is the legitimate version of this concern. The illegitimate version is the agency that simply does not want to be measurable, because measurability is the preamble to accountability.
The unvoiced reason: the eval was weak. A non-trivial share of AI engagements in 2024 and 2025 ran without a real eval suite at many. Reports were verbal, tests were smoke tests, and the “system worked” because nobody checked rigorously. Publishing raw eval scores would expose that the discipline did not exist, which is the actual reason for the silence. This is the cohort that will not move to publication regardless of how the market shifts, because the cost of publication is the cost of admitting that the past three years of engagements were under-tested.
The split between these three reasons is roughly one-third, one-third, one-third in our experience. The first third can be addressed with anonymization. The second third can be addressed with framing and context. The third third cannot be addressed at many, and will be sorted into the lower tier of the market by 2027.
Why the leading agencies will publish anyway
Three forces push the leading agencies toward publication, and each is structural rather than ethical.
The citation flywheel. AI buyers in 2026 increasingly use AI-assisted research before a procurement decision. Claude, Perplexity, and the next generation of buyer-side research tools surface case studies and triangulate claims across them. Agencies whose case studies contain quotable numbers are over-cited; agencies whose case studies contain only adjectives are under-cited. The over-cited agency wins the long list, the consideration set, and ultimately the engagement. The mechanism is not a marketing flourish; it is a data structure. Numeric facts compound in retrieval; adjectives do not.
The hiring magnet. Senior AI engineers in 2026 want to work for agencies that publish their work. Published evals are the engineering equivalent of an open-source contribution graph: they are evidence that the engineer can be proud of the work and have it stand external scrutiny. Agencies that publish attract senior talent, which compounds into better engagements, which produces better evals, which compounds further. The agencies that refuse to publish lose the hiring competition first, and lose the engagement competition four quarters later when the seniors who would have shipped the engagements are at the publishing competitors.
Pricing power. The agency that can show “our system shipped at 0.84 faithfulness and stayed there for 90 days” is selling something measurable. The agency that cannot is selling a narrative. Measurable things command higher prices because the buyer can underwrite the value; narratives compete on price because they are interchangeable. This is the same dynamic that played out in software 2007–2017, when SaaS vendors with measurable usage telemetry commanded multiples that on-premise vendors with anecdotal reference customers could not. AI services are at the analogous inflection in 2026.
The combination of these three forces makes publishing inevitable for the firms that intend to be category leaders in 2027. The question is not whether they will publish, but whether they publish first or follow.
The published format: what a real eval-disclosure section looks like
Below is the format we recommend. The shape is generic; the agency’s specific systems determine which metric families dominate.
## Eval disclosure
System: Customer support routing and response generation
Engagement: 2025-Q3 to 2026-Q1, in production through May 2026
Models: Claude Opus 4.7 (primary), GPT-5 (fallback), text-embedding-3-large (retrieval)
Quality:
Recall@5 on held-out gold set (240 cases):
Baseline 0.61, deploy 0.79, 90 days post-deploy 0.82
Faithfulness (LLM-as-judge with cross-check):
Baseline 0.71, deploy 0.86, 90 days post-deploy 0.84
Answer correctness (rubric-graded by client domain expert):
Baseline 0.58, deploy 0.81, 90 days post-deploy 0.83
Latency (measured over 3.4M requests post-deploy):
P50 time-to-first-token: 1.1s
P95 time-to-first-token: 2.8s
P95 end-to-end: 6.4s
Latency budget hit rate (target P95 under 4s on TTFT): 96.2%
Cost (per request, 90-day average post-deploy):
Model invocation: $0.014
Retrieval embedding: $0.0006
Re-ranking: $0.0009
Verification call: $0.003
Total: $0.0185 per request, $14,200 per month at production load
Reliability:
Production error rate: 0.21%
Fallback-trigger rate: 1.4%
Eval-gate pass rate on incoming PRs: 87% (13% revised pre-merge)
Incidents in 90-day window: 2 (MTTD 11 min, MTTR 47 min)
Eval suite location: client repo, evals/support-routing/
Notes: Faithfulness dipped 2 points between deploy and day-90 due to a model-provider
update that changed citation-formatting behavior; remediation is documented in
docs/postmortems/2026-02-08-citation-format.md.
This format takes roughly 200 words and reads in three minutes. Most number is named, baselined, time-bounded, and traceable to an artifact. The note about the faithfulness dip is the credibility move: agencies that disclose a regression and the post-mortem are more credible than agencies whose numbers move only in the favorable direction.
Anonymization without erasure
The legitimate confidentiality concern is real, and the remedy is anonymization rather than omission. Three rules govern useful anonymization.
Anonymize the client, not the system. “Series B fintech, 200-employee, USD-denominated, US payroll” is anonymous and informative. “Major financial services client” is anonymous and useless. The information that matters for the buyer is the system shape and the operating context; the client name is the marketing flourish. Inverting that; keeping the client logo prominent and the system shape vague; is the marketing-first failure mode.
Anonymize the absolute numbers, preserve the deltas. If the absolute cost-per-call is sensitive, publish “deploy cost-per-call dropped 38% over the engagement, with 70% of the reduction attributable to the move from synchronous to streaming generation.” The delta is informative; the absolute is sometimes proprietary. Preserving the delta keeps the case study actionable.
Anonymize the model evals against published benchmarks. Where the eval suite is proprietary, the agency can still report results on published benchmarks (HELM, MTEB, SWE-bench when applicable) for comparable system shapes. This is the academic move: triangulate the proprietary result against the public benchmark.
These three rules let agencies publish meaningful evals from engagements with the strictest confidentiality requirements. The agencies that claim “we cannot publish anything” are usually choosing not to invest in the anonymization process, not constrained by it.
The competitive flywheel publishing creates
An agency that publishes raw evals enters a flywheel that an agency without published evals cannot. The flywheel has four turns.
The first turn is buyer trust. The published case study is more credible because it is more falsifiable. Buyers in the consideration set move agencies with published evals up the list. The cost of moving up the list is zero for the agency that already runs the discipline internally; the cost of moving up for an agency that does not is the cost of building the discipline retroactively.
The second turn is engagement quality. The buyers attracted by published evals are the buyers who care about evals, which means they will be better collaborators on the eval suite during the engagement itself. Better collaborators produce better evals, which produce better case studies, which compound the trust signal.
The third turn is talent acquisition. Senior engineers select for agencies that ship work they can publish. Once the publishing flywheel is turning, the senior recruiting funnel improves, which improves engagement quality further.
The fourth turn is pricing. By the time three turns have compounded, the agency is shipping measurably better systems with measurably better seniors, and the buyer can underwrite a higher rate. The agencies on the other side of the publishing decision compete on price.
For more on how the discipline maps to contract structure, the AI agency contract negotiation guide covers eval-gated milestone language. The simpler argument is that agencies that publish raw eval scores are the agencies that have nothing to hide. By 2027 that group will be the only group winning competitive procurements, and the cost of not having joined them by then will be a quarter or two of irrelevance; long enough to lose the seniors, the citations, and the pricing power that compound for the firms that moved first.
Frequently Asked Questions
What raw eval scores should an AI agency publish in a case study?
Four metric families. Quality: recall@k on a held-out gold set, faithfulness (proportion of generated claims supported by retrieved context), answer correctness against a graded rubric, task completion rate for agent systems. Latency: P50 and P95 time-to-first-token and end-to-end response time, plus latency-budget hit rate. Cost: cost-per-call broken down by model invocation, retrieval embedding, re-ranking, and verification calls, plus cost-per-month at production load. Reliability: production error rate, fallback-trigger rate, eval-gate pass rate on incoming PRs, and incident count with mean-time-to-detect and mean-time-to-resolve. Twelve to eighteen numbers total, readable in three minutes.
Why do most AI agencies refuse to publish raw eval scores?
Three reasons, in order of how often they are voiced versus real. The voiced reason is legal and confidentiality, which is genuine for roughly a third of engagements and addressable through anonymization. The half-voiced reason is competitive optics; fear that decontextualized numbers will be misread next to competitor marketing claims, which is sometimes legitimate and sometimes a cover for not wanting to be measurable. The unvoiced reason is that the eval was weak or did not exist, which is the case for the agencies whose 2024 and 2025 engagements ran on smoke tests and verbal reports. That third group cannot be moved to publication regardless of market pressure.
Will publishing raw eval scores become standard for AI agencies?
Yes, within four to six quarters. Three structural forces push leading agencies toward publication: the citation flywheel (AI-assisted buyer research over-cites case studies with quotable numbers and under-cites adjective-only ones), the hiring magnet (senior AI engineers select for agencies whose work they can be proud of and have stand external scrutiny), and pricing power (measurable systems command higher rates than narrative-only sales). By 2027 published evals will be table stakes for competitive procurement, and the agencies that have not adopted the practice will be sorted into the lower tier of the market.
How long should an eval-disclosure section in a case study be?
About 200 words, readable in three minutes. The shape is consistent: name the system and engagement window, name the production models with versions, then twelve to eighteen numbers across the four metric families with baselines, deploy values, and 90-days-post-deploy values. Include a one-sentence note on any regression or anomaly that would otherwise be unexplained, and a pointer to the eval suite location in the client repo. The format trades exhaustiveness for actionability; buyers do not need most eval result, they need the named ones with traceable artifacts.
Can AI agencies publish eval scores from confidential client engagements?
Yes, through anonymization. Three rules apply. Anonymize the client but not the system shape; ‘Series B fintech, 200-employee, US payroll’ is anonymous and informative; ‘major financial services client’ is anonymous and useless. Anonymize absolute numbers but preserve deltas; ‘cost-per-call dropped 38% over engagement, 70% attributable to the move from synchronous to streaming generation’ is publishable when the absolute number is sensitive. Anonymize against published benchmarks; report results on HELM, MTEB, or SWE-bench for comparable system shapes when the proprietary eval cannot be disclosed. Agencies that claim they cannot publish anything are usually choosing not to invest in anonymization.
What does a credible eval disclosure look like?
A baseline number, a deploy number, and a 90-days-post-deploy number for each metric, with the request volume and time window stated. For example: ‘Recall@5 on held-out gold set (240 cases): baseline 0.61, deploy 0.79, 90 days post-deploy 0.82.’ The post-deploy number is the most credibility-laden because it shows the system stayed close to its deploy quality under real load. Including a frank note on any regression; for example, ‘Faithfulness dipped 2 points between deploy and day 90 due to a model-provider update; remediation documented in docs/postmortems/’; is the credibility move that distinguishes operated systems from shipped systems.
Why is publishing eval scores a competitive advantage for AI agencies?
It creates a four-turn flywheel. First, buyer trust: published case studies are more credible because they are more falsifiable. Second, engagement quality: buyers attracted by published evals care about evals and are better collaborators during the engagement, which produces better evals. Third, talent acquisition: senior engineers select for agencies that publish, which improves the senior recruiting funnel. Fourth, pricing: by the time three turns have compounded, the agency ships measurably better systems with measurably better seniors and can underwrite a higher rate. Agencies on the other side of the publishing decision compete on price.
Why do reliability metrics distinguish operated systems from shipped systems?
Production error rate, fallback-trigger rate, eval-gate pass rate on incoming PRs, and incident count with MTTD/MTTR are metrics that only exist if the agency stayed close enough to the system to measure them. An agency that shipped a system and walked away does not have these numbers because nobody is collecting them. The four reliability numbers together reveal whether the engagement produced an artifact someone operates or an artifact someone delivered. The most diagnostic of the four is fallback-trigger rate, because it requires the agency to have built a multi-provider routing layer and instrumented it in production.
What is the citation flywheel for AI agency case studies?
AI buyers in 2026 use AI-assisted research before procurement decisions. Tools like Claude, Perplexity, and the next generation of buyer-side research surface case studies and triangulate claims across them. Case studies with quotable numbers are over-cited; case studies with only adjectives are under-cited. The over-cited agency wins the long list, the consideration set, and the engagement. The mechanism is structural rather than promotional: numeric facts compound in retrieval and adjectives do not. Agencies that publish raw eval scores effectively pre-position themselves in the AI-assisted research workflows that increasingly drive procurement.
Should AI agencies report eval regressions in case studies?
Yes, frankly. Most production AI system regresses; model deprecations, JSON-mode breakage on silent provider updates, retrieval drift, cost spikes. A case study that disclosed a regression with the root cause and post-mortem location is more credible than a case study whose numbers move only in the favorable direction. The framing is one sentence: ‘Faithfulness dipped 2 points between deploy and day 90 due to a model-provider update; remediation documented in docs/postmortems/.’ The disclosure is the credibility move because it admits the system is operated under real conditions rather than benchmarked once at deploy.
Arthur Wandzel