Most AI agency case studies in 2026 cannot survive 15 minutes of structured questioning. They are written by marketing teams who interviewed an account director, optimized for the hero metric (“45% faster” or “3x throughput”), and signed off by a partner-level executive who rarely spoke to the engineer who shipped the work. That does not mean the agency is fraudulent; it means the case study is a sales document, and reading it as evidence of capability is a category error. The remedy is not to ignore case studies; they remain the single best surface to triage agency claims at the long-list stage; but to read them with a structured eye for which elements are real and which are marketing.
This piece decodes eight elements of a typical AI agency case study and tells you what to look for in each. Most element has a “real” version (what the agency did, attested) and a “marketing” version (what the case study implies, unattested). Most case studies mix the two; the decoding job is to separate them. By the end you should have a list of seven to nine specific questions to put to the agency before progressing them to a paid evaluation, the kind described in the field guide to evaluating an AI agency. The framing throughout is the forward-deployed AI dev partner standard described in the manifesto: if the case study cannot back up the manifesto-grade claims, the agency probably cannot either.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Table of contents
- Why case study decoding matters in 2026
- Element 1: the client logo
- Element 2: the headline metric claim
- Element 3: eval methodology disclosure
- Element 4: model and version named
- Element 5: the time period stated
- Element 6: regression handling
- Element 7: post-launch outcome
- Element 8: agency vs client team contribution
- The seven questions to put to the agency
Why case study decoding matters in 2026
A case study is the most concentrated unit of evidence an AI agency provides about itself. It is also the artifact most distorted by marketing. The distortion is not unique to AI; most services category has it; but the AI version has a specific shape because the substrate is non-deterministic, the evals are non-trivial, and the “47% improvement” claim is technically meaningless without context. An agency can ship a system that improves a synthetic benchmark by 47 percent and degrades the production user experience; the case study can truthfully claim the benchmark improvement and rarely mention the degradation. This is not lying; it is selective truth-telling, and decoding it is the buyer’s job.
The other reason decoding matters is that the case study is a sample of the agency’s writing discipline. An agency that publishes case studies with named eval methodology, frank regression handling, and explicit time bounds is signaling that they write the same way internally. An agency whose case studies are gauzy and metric-only is telling you what their internal documentation looks like, which is a strong predictor of how the engagement will be run.
Element 1: the client logo
The first thing on the page is a client logo. The decoding question is whether the logo is real-and-permitted, real-but-licensed, or pure marketing.
Real and permitted means the named client gave written permission to use their logo, with quotation rights, in the context of the case study. The logo appears with a real client title attribution (“Director of Engineering, Stripe”) and the client may participate in reference calls. This is the strongest version.
Real but licensed means the logo appears under a generic “trusted by” arrangement that does not imply specific work. The agency may have run a procurement workshop, a discovery sprint, or a single small project, and the logo is licensed for general marketing use without the right to attribute specific outcomes. This is permitted but weaker; the logo on the case study page does not necessarily mean the agency built the system being described.
Marketing means the logo is on the page but is not directly tied to the work being described. The case study reads as if the work was for that client, but the actual paragraph rarely says so explicitly. Lawyers wrote the page to be defensible, not informative. The buyer signal here is the absence of a named, quotable client contact.
The question to ask: “May I speak to a named contributor at the client side of this engagement, on a 30-minute call, this week?” An agency that can produce that contact within five business days is in the real-and-permitted bucket. An agency that demurs (“the client is sensitive to outreach”) is somewhere in the other two.
Element 2: the headline metric claim
The headline metric; “45% faster,” “3x throughput,” “$2M in cost savings”; is the most marketing-distorted element of the case study. The decoding question has three parts: what was measured, against what baseline, and over what period.
A real metric reads: “Median time-to-first-token on the customer support inbox flow dropped from 4.2s to 2.3s after the routing layer was migrated to LiteLLM with provider-aware fallback, measured over the 60 days following deploy across 3.4M requests.” The metric is named, the baseline is named, the time bound is named, and the measurement infrastructure is implied (P50 latency over a known volume).
A marketing metric reads: “45% faster customer support.” The reader fills in the gaps: faster than what, for whom, measured how, sustained for how long. None of those gaps are filled. The metric may be technically true (some measurement somewhere produced that number), but it is not actionable evidence of capability. The marketing metric is also resistant to verification, because the reader does not know what to verify.
The question to ask: “What was the baseline metric, what was the post-deploy metric, and over what time window were both measured?” An agency that answers in 30 seconds with named measurements is real. An agency that needs a follow-up email is marketing.
Element 3: eval methodology disclosure
The eval methodology is the part of the case study that separates 2026-grade agencies from 2024-grade agencies. The 2026-grade case study describes the eval suite specifically: “We built a 240-case eval set drawn from production support tickets, with three pass conditions per case (factual correctness, tone match, citation presence), measured at deploy and weekly thereafter. The eval suite is committed to the client’s repo at evals/support-routing/.”
The marketing version handwaves: “We rigorously tested the system.” The reader cannot tell whether “rigorous” means a 12-case smoke test or a 1,000-case golden set. Worse, the absence of methodology disclosure usually correlates with the absence of methodology; agencies that wrote a real eval suite tend to want to talk about it, because it is the artifact they are proud of.
The third version, common in 2024 case studies and increasingly disqualifying in 2026, is metric-without-methodology: “We achieved 92% accuracy.” Accuracy on what test set, with what scoring, is unanswered. The number is unfalsifiable, which is the marketing tell.
The question to ask: “Where did the eval set come from, how many cases, what were the pass conditions, and is the suite still running in CI?” Real agencies will pull up the directory in their git history. Marketing agencies will offer to “send over the methodology document.”
Element 4: model and version named
Specific models, named with version, are a marker of technical seriousness. “Claude Opus 4.7” is a real claim. “Latest large model” or “GPT-class system” is a marketing claim. The version matters because model behavior changes between versions in ways that invalidate prior evals; an agency that names the model is signaling that they ran the evals on that specific version and would re-run them if the version changed.
The case study should also name the routing layer, the embedding model (if retrieval is involved), and the inference provider. A claim like “we used a state-of-the-art retrieval-augmented generation pipeline” is a marketing claim. A claim like “we used text-embedding-3-large for chunk embeddings, Qdrant for vector storage, and Cohere Rerank 3.5 for re-ranking, with cross-encoder verification on the top-10 results” is a technical claim that can be verified or challenged.
The question to ask: “What models were in production at the decline of the engagement, named with version, and what was the cost-per-request profile?” An agency that cannot name the production stack precisely is signaling that they are not the team that runs the system, which means whatever they did has not been operationalized.
Element 5: the time period stated
A case study with no dates is a case study that did not happen, or happened so long ago that the work is not relevant. AI tooling moves quarterly; a system architected in 2024 would be wrong in 2026 in specific ways (no MCP, single-provider lock-in, no cost-per-call observability). The case study should name the engagement window: “engagement Q3 2025, system shipped November 2025, in production through May 2026.”
A real case study will name three dates: engagement start, system shipped, last verified in production. The third date matters because production-in-2025 systems that have not been touched in 2026 are typically broken, because at least three model deprecations and two API changes will have happened in the intervening months.
The marketing version omits dates entirely, or quotes only the engagement start (“we partnered with Acme in 2024”). The omission is structural: dates make case studies expire, and marketing teams optimize against expiration. The buyer should treat undated case studies as automatically stale.
The question to ask: “When did the system last get a production code change, and what was that change?” An agency with an active engagement will have an answer from this week. An agency with a stale case study will need to check.
Element 6: regression handling
Most AI system in production regresses. The model provider deprecates an endpoint. A new model version changes outputs. Retrieval drift accumulates. Cost spikes after a silent change in upstream pricing. The case study that omits regression handling is the case study that has rarely been pressure-tested against reality.
A real case study includes at least one regression story. “In month four, the upstream provider shipped a model update that broke the JSON-mode adherence on our routing layer. Detection was via the eval suite (faithfulness dropped from 0.91 to 0.74 in 14 hours), remediation was a temporary pin to the prior version while we updated the structured output handler, and the post-mortem is in the client’s repo.” This story is short, specific, and credible because it admits a failure.
The marketing version is silent on regressions. The implication is that the system has been smooth since deploy, which is implausible at the 12-month horizon and impossible at the 24-month horizon. The buyer who reads regression silence as good news is misreading the document; regression silence means the agency does not write post-mortems, which is the marker covered in the AI agency contract negotiation guide under post-launch obligations.
The question to ask: “Walk me through the worst regression on this system in the last six months, and where the post-mortem lives.” Agencies that have one will have it ready. Agencies that do not will pivot to a story about a different client.
Element 7: post-launch outcome
The case study covers the build. The decoding question is what happened after the build. Post-launch is where most AI engagements decay, because the agency rolls off, the client team is unprepared to operate the system, and the evals stop running because nobody owns them.
A real post-launch section names the operational handoff: who owns the system now, what runbooks exist, what the on-call rotation is, and what the cost trajectory has been. A typical real version reads: “System is now operated by the client’s platform team (3 engineers), with the agency on retainer for monthly evals review and architecture consultations. Monthly inference cost has been within 8% of the budgeted ceiling for four consecutive months.”
The marketing version stops at deploy. The buyer is left to assume the system is still working, which is not an assumption that survives 12 months of production AI. The other marketing tell is the “ongoing partnership” formulation, which is shorthand for “we are still billing them”; a statement about the agency’s business, not the system’s health.
The question to ask: “What is the cost-per-request and cost-per-month trajectory for this system over the last six months?” This question separates agencies that operationalized the system from agencies that shipped it and walked away. The number itself matters less than whether they have it.
Element 8: agency vs client team contribution
The most subtle element is the contribution split. A case study that says “we built X” is implicitly claiming the agency is the system’s author. In practice almost usually a collaboration: the agency wrote the routing layer, the client team wrote the integration with their own data systems, the agency wrote the eval scaffolding, the client domain expert wrote the eval cases, and so on. Naming the split is a marker of intellectual honesty; eliding it is marketing.
A real case study reads: “The agency designed the architecture and wrote the model-routing layer, observability, and eval scaffolding. The client engineering team wrote the integration with their internal CRM and the human-in-the-loop review interface. The client domain expert authored 187 of the 240 eval cases. Production operation is shared, with on-call alternating weekly.”
The marketing version uses “we” throughout and rarely distinguishes. The risk to the buyer is hiring an agency that cannot do the work without a strong client team, and then discovering that their own team is not strong enough to be the missing half. The split also matters for IP; an agency that wrote 30 percent of the codebase has a different claim on the work product than an agency that wrote 90 percent.
The question to ask: “What percentage of the production codebase did your team author, by line count?” The number is not the point; the agency’s ability to estimate it credibly is.
The seven questions to put to the agency
Compile the questions raised by each element into a single email or call agenda. Send them to the agency before progressing to a paid evaluation. Score the responses on speed (within five business days is fast, beyond ten is concerning) and specificity (named artifacts beat narrative, named numbers beat adjectives).
- May I speak to a named contributor at the client side of this engagement on a 30-minute call this week?
- What was the baseline metric, what was the post-deploy metric, and over what time window were both measured?
- Where did the eval set come from, how many cases, what were the pass conditions, and is the suite still running in CI?
- What models were in production at the decline of the engagement, named with version, and what was the cost-per-request profile?
- When did the system last get a production code change, and what was that change?
- Walk me through the worst regression on this system in the last six months, and where the post-mortem lives.
- What percentage of the production codebase did your team author, by line count, and what is the cost-per-request and cost-per-month trajectory over the last six months?
The pattern of responses is more diagnostic than any individual answer. An agency that hits many seven with named artifacts is operating at the manifesto standard. An agency that hits two or three is mid-transition, and worth a paid evaluation if the partial answers are credible. An agency that hits zero or one is selling a 2024 service in 2026 packaging, regardless of how impressive the case study page looked. The seven questions cost nothing to send, take an hour to evaluate, and save the buyer a quarter of wasted engagement budget. They are the cheapest filter in the AI procurement process.
Frequently Asked Questions
How do I tell if an AI agency case study is real or marketing?
Decode eight elements: the client logo (real-and-permitted vs licensed-for-general-marketing), the headline metric (named baseline, time window, and measurement vs handwaved percentage), eval methodology (specific case count and pass conditions vs ‘rigorously tested’), model and version named, the engagement time period with at least three dates, regression handling with at least one frank failure story, post-launch operational handoff, and the agency-vs-client contribution split. Each element has a real version that names artifacts and a marketing version that elides them. Most case studies mix the two, and the decoding job is to separate them.
What does a credible AI case study metric look like?
It names what was measured, against what baseline, and over what period, with the measurement infrastructure implied. For example: ‘Median time-to-first-token on the customer support inbox flow dropped from 4.2s to 2.3s after the routing layer was migrated to LiteLLM with provider-aware fallback, measured over the 60 days following deploy across 3.4M requests.’ A claim like ‘45% faster customer support’ is not credible because it does not name the baseline, the user population, the measurement method, or the sustainment window. The unfilled gaps are where the marketing distortion lives.
Why does naming the model and version matter in an AI case study?
Because model behavior changes between versions in ways that invalidate prior evals. An agency that names ‘Claude Opus 4.7’ is signaling that they ran the evals on that specific version and would re-run them if the version changed. An agency that says ‘a state-of-the-art large language model’ is signaling that the engagement is not version-aware, which means the evals were a one-time exercise rather than a CI discipline. The corollary is the production stack; embedding model, vector store, re-ranker, inference provider; should also be named precisely if the agency operationalized the system.
What is the most diagnostic question to ask about an AI case study?
‘Walk me through the worst regression on this system in the last six months, and where the post-mortem lives.’ Most production AI system regresses; model deprecations, retrieval drift, cost spikes, JSON-mode breakage on a silent provider update. Agencies that operate systems have regression stories ready, with named root causes, eval-detected timelines, and post-mortems committed to the client’s repo. Agencies that pivot to a different story, or claim the system has been smooth since deploy, are signaling that they shipped and walked away. The regression-handling question is the cheapest filter for operational seriousness.
Should I dismiss an AI agency case study with no dates?
Treat it as automatically stale. AI tooling moves quarterly, and a system architected in 2024 would be wrong in 2026 in specific ways: no Model Context Protocol, single-provider lock-in, no cost-per-call observability, evals run once at deploy. The credible case study names three dates; engagement start, system shipped, last verified in production. The third date is the most diagnostic, because production-in-2025 systems that have not been touched in 2026 are typically broken from at least three model deprecations and two API changes in the intervening months. Undated case studies are sales artifacts, not technical evidence.
How important is the agency vs client team contribution split in a case study?
Critical, because almost most AI engagement is a collaboration. The agency typically writes the routing layer and observability and eval scaffolding; the client team typically writes the integration with internal data systems and the human-in-the-loop interface; the client domain expert typically authors most of the eval cases. A case study that uses ‘we’ throughout and rarely distinguishes is hiding the dependency on the client team. The risk to the buyer is hiring an agency that needs a strong client team, then discovering their own team is not strong enough. The contribution question; ‘what percentage of the production codebase did your team author, by line count?’; surfaces the dependency.
What does post-launch outcome reveal about an AI agency?
Whether they operationalized the system or shipped it and walked away. The credible post-launch section names the handoff: who owns the system now, what runbooks exist, what the on-call rotation is, and what the cost trajectory has been over the last six months. The marketing version stops at deploy and uses ‘ongoing partnership’ as a euphemism for ‘we are still billing.’ The diagnostic question is ‘what is the cost-per-request and cost-per-month trajectory over the last six months?’ The number itself matters less than whether the agency has it; agencies that operationalized systems track cost telemetry, and agencies that did not are bluffing.
What are the seven questions to send an AI agency before a paid evaluation?
May I speak to a named contributor at the client side on a 30-minute call this week? What was the baseline metric, post-deploy metric, and time window? Where did the eval set come from, how many cases, what pass conditions, and is the suite still running in CI? What models were in production at the decline of the engagement, named with version? When did the system last get a production code change? Walk me through the worst regression in the last six months and where the post-mortem lives. What percentage of the production codebase did your team author by line count, and what is the cost-per-request and cost-per-month trajectory? Score responses on speed (under five business days is fast) and specificity (named artifacts and numbers beat narrative).
Why do most AI agency case studies fail 15 minutes of structured questioning?
Because they are written by marketing teams who interviewed an account director, optimized for a hero metric, and signed off by a partner-level executive who rarely spoke to the engineer who shipped the work. The case study reflects the agency’s writing discipline: agencies that publish vague case studies write vague internal documentation, and the engagement will be run with the same vagueness. Conversely, agencies whose case studies have named eval methodology, frank regression handling, and explicit time bounds are signaling that their internal documentation is similarly disciplined. The case study is a sample, not the deliverable.
Is regression silence in a case study a good sign?
No, it is the opposite. Regression silence at the 12-month horizon is implausible and at the 24-month horizon is impossible; most production AI system has model deprecations, JSON-mode breakage, retrieval drift, or cost spikes in that window. A case study that does not include at least one frank regression story is signaling that the agency does not write post-mortems, which means failures get verbal-handled rather than turned into eval cases and runbook entries. The buyer who reads regression silence as good news is misreading the document; regression silence is the marker of an agency that bills past failures rather than learning from them.
Arthur Wandzel