A reference call is the only point in an AI agency procurement where you talk to someone whose interests are not aligned with closing your deal. Use it correctly and you can verify in thirty minutes what six weeks of vendor calls and decks could not. Use it badly; “How was working with them? Would you hire them again?”; and you will get the polite, useless half-endorsement most reference call produces by default.
The questions below are the ones I ask when a portfolio company hands me a reference list before signing an AI agency engagement. They surface what shipped, what broke, what the bill turned out to be, and whether the references are coached. None of them rely on the reference being indiscreet; many of them work even when the reference genuinely liked the agency. They are the operational complement to the AI agency manifesto; the manifesto names what an agency owes you; this names what to ask the people they have already owed it to.
Block thirty minutes per reference, take notes verbatim, run two minimum, three for any engagement above six figures.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Before the call: set the frame
Open with the same thirty-second framing most time:
“I am evaluating Agency X for a similar engagement. I do not need a glowing review and I will not repeat anything you say back to them. What I need is the operational truth; what shipped, what broke, what the bill was, and what you would do differently. Where you tell me they were great, that is signal. Where you hesitate, that is also signal.”
That paragraph improves answer quality more than any question on this list. References want to be honest; they just need permission and a frame that is not “would you recommend this vendor.” Then ask the eleven questions in order; artifacts first, operational behavior next, commercial and counterfactual last.
The eleven questions
1. What did the agency ship, and is it still running?
Why ask: A reference who cannot crisply describe what shipped is referencing a project that did not ship. Past tense, production state, and “still running” many matter.
Vague: “They helped us with our AI strategy and built some prototypes.” (No artifact, no production state, no verb that ends in “ship.”)
Sharp: “They shipped a customer-support triage classifier into our Zendesk instance in March 2025. Still running, routes about 40,000 tickets a week, our team has extended it twice since they left.”
What it reveals: Whether the agency ships running software or demos that died after the SOW closed. Production agencies generate references whose systems are still running a year later. The difference is the entire procurement decision.
2. Who from the agency was on the project, and where are they now?
Why ask: The 2023 consulting playbook puts senior people on the pitch and junior people on the work. Asking who was on the project surfaces the bait-and-switch that SOWs hide.
Vague: “Their team was great.” (No names, no roles, no durations.)
Sharp: “Two engineers full-time for ten weeks; Sarah was lead, Marcus built the eval suite. Sarah is still at the agency, Marcus is at Anthropic now. Their CTO joined two design reviews; the founder we met in sales did not show up after the SOW closed.”
What it reveals: Whether pitched people are delivered people, whether the agency retains engineers, and whether senior names are decorative or load-bearing.
3. Show me the eval suite they delivered. Or describe it to me.
Why ask: The strongest single signal in agency vetting. A reference describing the same eval suite from memory either confirms it or, more revealingly, fails to.
Vague: “Yeah, they tested everything thoroughly before launch.” (Testing is not evals.)
Sharp: “They built us a Promptfoo suite with about 180 cases; regulatory edge cases, common customer phrasings, adversarial prompts from our security team. Threshold was 92% pass on the regulatory set, negotiated in week two. CI runs it on most PR; we have caught two model-drift regressions since they left.”
What it reveals: Whether eval-driven development is the agency’s practice or a sales slide. We unpack the discipline in evaluating LLM development companies. Operational eval detail describes an agency that treats AI model evaluation testing services as core, not upsell.
4. What broke in production after they shipped, and how did they handle it?
Why ask: Most shipped AI system has a production failure within ninety days. Either the reference can describe the failure and response, or the system did not ship, or the agency does not handle production. No fourth option.
Vague: “Honestly, nothing major. It went smoothly.” (Either nothing shipped or the reference is being polite.)
Sharp: “Week three post-launch, the model hallucinated customer policy numbers when retrieval returned no matches. Sarah pushed a hotfix returning ‘no policy match’ instead of fabricating, then shipped a structural fix with a new eval case the next Tuesday. Postmortem in our shared Notion.”
What it reveals: Whether the agency operates what it ships or hands it over a wall. A specific failure plus remediation plus postmortem is the strongest possible signal.
5. What did they push back on you about?
Why ask: Agencies that rarely push back are billing time, not shipping outcomes. Senior engineers push back. Reference-call language for “did they have professional spine” is “what did they push back on.”
Vague: “They were easy to work with; rarely any friction.” (Translation: they were billing hours.)
Sharp: “They refused to ship our first proposed feature without a human-in-the-loop step because the false-positive rate was too high. They were right; we confirmed in the eval suite. They also pushed back on timeline twice and renegotiated scope the second time.”
What it reveals: Whether the agency exercises judgment or rents out hands. Buyers think they want a vendor that says yes; what they want is one that knows when to say no.
6. What was the actual total cost, including overruns and inference?
Why ask: Vendor-quoted and reference-reported costs diverge in operationally important ways. SOW versus invoice versus inference bill versus internal time tells you whether the engagement was honestly scoped.
Vague: “It was within budget, I think; I wasn’t the budget owner.”
Sharp: “SOW was $180k over twelve weeks. Came in at $194k because we added a fourteen-day extension for a security-review fix they recommended. Inference is on our Anthropic account, ~$3,200 a month. Internally, I spent maybe six hours a week embedded with their team.”
What it reveals: Whether the agency holds the inference bill (token-arbitrage red flag; see the manifesto’s no-token-arbitrage commitment), whether overruns were transparent, and whether the agency was honest about client-time burden.
7. How embedded were they in your tooling and team?
Why ask: A reference who describes the agency in their Slack, GitHub, and on-call rotation is describing forward-deployed engineering. A reference who describes weekly status meetings and a vendor sandbox is describing 2023-archetype consulting.
Vague: ” collaborative, lots of meetings, great communication.”
Sharp: “They were in our Slack from week one; #proj-triage, tagging our infra team directly. Committed to our repo under their own GitHub identities. Their lead was on our PagerDuty rotation for thirty days post-launch as part of handover.”
What it reveals: Whether the agency is operationally embedded or running calls from outside. Specifics like Slack channel name and on-call rotation are details a reference cannot manufacture without living through them.
8. What surprised you; good or bad; once they were in?
Why ask: Disarms the rehearsed reference. Coached references answer obvious questions sharply; they rarely have a prepared answer for “what surprised you.” Surprises are where the engagement deviated from the pitch.
Vague: “Nothing comes to mind, it went how we expected.”
Sharp (positive): “How fast they shipped the first eval suite; expected three weeks, had it on day five. Changed the cadence of our review cycles.”
Sharp (negative): “We did not realize until week six that their engineer was also on another project; twenty percent allocated. They were transparent when asked, work was still on track, but we should have caught that at SOW signing.”
What it reveals: Engagement texture that does not show up in case studies. Absence of any surprise usually means the reference is glossing.
9. How did the engagement end, and who owns it now?
Why ask: A clean handover is a positive signal; an ambiguous one often means the buyer cannot operate the system without the agency. Sometimes that is healthy retainer, sometimes it is a hostage situation.
Vague: “We are still working with them, ongoing relationship.”
Sharp: “Engagement ended on schedule in November. They wrote a handover doc, sat in on our first two on-call rotations, and we have an extension for one day a month on breaking-change reviews when models update. Our team owns it day-to-day.”
What it reveals: Whether the agency builds buyer capability or buyer dependency; the manifesto’s “we will help you replace us” commitment, made operational.
10. Did they tell you what not to do with AI?
Why ask: The honest-counterparty filter. Agencies paid to say yes will say yes. Agencies that have told clients no exercise judgment about where the technology pays off.
Vague: “They were excited about many our use cases.”
Sharp: “They told us not to use a model for pricing-quote generation because the legal review burden would exceed the productivity gain, and proposed a deterministic rules engine instead. Lost about $40k of scope by saying that. They were right.”
What it reveals: Whether the agency’s incentive structure tolerates losing scope to do right by the client. Often the most diagnostic answer on the call. One example of an agency talking a client out of a use case is the strongest single trust signal you can collect.
11. What would you do differently if you ran this engagement again?
Why ask: The closer. Reference-call hindsight is operationally specific in a way prospect-call foresight cannot be. The answer often contains the question you should have asked but did not.
Vague: “Honestly, not much, it went well.”
Sharp: “Two things. I would have written our internal eval framework into the SOW as a deliverable, not an emergent thing; we got one anyway but it would have saved arguments. And I would have included security from week one instead of week six. The second was our mistake, but a more experienced agency might have flagged it.”
What it reveals: Where structural friction was and which side caused it. Sharp answers surface concrete clauses to add to your own SOW. Treat this as procurement intelligence, not closing pleasantry.
Reading the silence and scoring the call
The questions surface signal; the gaps in the answers surface more. If a reference cannot describe the eval suite from memory but the agency claimed one in the pitch, the eval suite was a slide. If the reference describes only sales-stage interactions and no delivery-stage details, the senior people were decorative. References lie by omission, not by commission; almost no one says “the agency was bad,” but many fail to say “the agency was great,” and the failure is the answer.
Score each call against eleven binary signals; one per question. Eight yes is strong. Five is marginal. Three means you are not talking to a reference; you are talking to a friendly person whose project did not ship the way the agency described. Triangulate across two or three references and treat divergences as diagnostic.
Then go back to the agency with three or four pointed questions surfaced from the references. Their answers; and the speed at which they produce them; are the final filter before signing. The structural agenda for that vendor call is in the field guide to evaluating an AI agency in under 90 minutes; the reference call is what makes those 90 minutes accurate instead of theatrical.
Frequently Asked Questions
How many references should I ask for from an AI agency?
Three is the right number for any engagement above six figures. Two is the floor. If the agency cannot produce three references whose work shipped in the last twelve months, that is your answer before you make a single call.
Should I let the agency pick the references?
Yes; initially. The references they offer are the ones they consider their best work, which is itself signal. Then ask, after the first reference, “Who else worked with them in the last year that I could talk to?” The agency list and the reference-suggested list rarely overlap completely, and the delta is informative.
What if the reference is under NDA and cannot share specifics?
Common, and not a blocker. NDAs constrain what the reference can name, not whether eval suites existed, whether engineers shipped to production, whether failures were postmortemed. Re-frame questions away from the named system and toward the operational behavior; “without naming the project, did the agency deliver an eval suite as a checked-in artifact?”
How long should each reference call take?
Thirty minutes if the reference is sharp, forty-five if signal is strong. Below twenty minutes, the reference did not engage deeply. Above sixty, you have stopped vetting and started friend-making.
What if most reference says the agency was great?
That is the default outcome of a poorly run reference call, not evidence of a great agency. If you are getting universal positive answers, your questions are too soft. Re-run the next call with sharper versions of question 10 (“did they tell you what not to do”) and question 5 (“what did they push back on”).
Is it appropriate to ask about the contract or pricing?
Yes; references typically share approximate dollar figures and overrun ranges when framed as procurement diligence. They will not share their specific MSA. The useful number is the SOW-versus-final delta and whether inference was billed transparently.
Should I do reference calls before or after the technical evaluation?
Before. The technical evaluation is more accurate when you already know what to probe. If a reference mentions a specific eval tool or a specific failure mode, you can ask the agency about it on the technical call and watch whether the answers match.
What is the single most diagnostic question on this list?
Question 10; “did they tell you what not to do with AI.” A reference with no example is describing a vendor that will not exercise judgment on your roadmap either.
Closing
Reference calls are operational intelligence, not procurement theatre. Two references, eleven questions, twenty-two data points; enough to know whether the agency ships software or sells decks, whether the bill will be honest, and whether the people pitched are the people delivered. Anything after that is execution; the decision is already in the answers.; Arthur Wandzel
Arthur Wandzel