Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 16 min read

A field guide to evaluating an AI agency in under 90 minutes

A field guide to evaluating an AI agency in under 90 minutes

If you cannot decide on an AI agency in 90 minutes, you are not asking the right questions. Vetting calls bloat into multi-week procurement theatre because buyers have no falsifiable rubric, so they keep gathering soft signals until exhaustion forces a coin flip. The shortcut is a structured 90-minute agenda that forces the agency to show evidence: artifacts, eval suites, architecture diagrams, post-mortems, and a transparent line-itemed commercial proposal. If they can produce many five, you have your partner. If they cannot, you have your answer just as quickly.

This is the agenda I run when a portfolio company asks me to sit in on an agency evaluation. It is opinionated, time-boxed, and it embarrasses 2023-archetype consulting firms within the first ten minutes. It rewards the kind of forward-deployed engineering team described in the AI agency manifesto; the only kind worth hiring in 2026.

Block 90 minutes. Tell the agency in advance you will be looking at concrete artifacts from a recently shipped project, that the call is technical, and that you expect engineers; not just account executives; on the call. Any agency that cannot meet that bar in the prep email is already a no.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

0–10 minutes: the artifacts walkthrough

Start with the part of the call that ends 30 percent of evaluations within ten minutes: ask to see real artifacts from a recently shipped project. Not a deck. Not a screenshot of a demo. Pull-request links, repo structure, the actual eval suite, the actual deployed system.

1. Screen-share a pull request from a project shipped in the last 60 days.

  • Green: they navigate to a PR, walk through the diff, point at the eval changes that gated the merge.
  • Yellow: redacted PR description, no actual code, citing NDA; acceptable if they offer a synthetic example.
  • Red: they pivot to a deck or “case studies” instead of code.

2. Show me the README of that repo’s evals/ directory.

  • Green: it exists. It explains the eval format, the threshold, who set it, why, and how new evals get added.
  • Yellow: evals exist but live in a notebook; threshold is verbal.
  • Red: no evals/. They use “evals” to mean “we tested it manually.”

3. Show me the deploy pipeline.

  • Green: CI runs the eval suite on most PR, blocks merges below threshold, posts the eval delta as a comment.
  • Yellow: evals run nightly, not per-PR, but failures page someone.
  • Red: “We deploy from a developer’s laptop,” or evals are not part of the gate.

Three reds in this block and you can end the call at minute 10. They do not ship production AI software. They demo it.

10–30 minutes: the eval discipline check

Eval discipline is the single sharpest competence signal in 2026. The practice is well-defined, the tooling is mature (Promptfoo, LangSmith, Braintrust, Anthropic’s eval tooling, OpenAI’s evals API), and agencies that take evals seriously sound recognizably different from those that treat them as a marketing word. Spend 20 minutes here.

1. Walk me through the eval threshold you set on the most recent project. Who chose the number? Why?

  • Green: a specific number (e.g., “92% on the legal-clause classifier suite”), set with the client’s domain expert, with a written rationale tying it to a business outcome.
  • Yellow: a number, but vague rationale (“we picked 90 because it felt right”).
  • Red: no number, or a number disconnected from any client-specific outcome.

2. Tell me about a time the eval suite caught a production regression. What broke? What was the fix?

  • Green: a specific story with model version, regression type, detection time, and remediation, told from memory.
  • Yellow: a generic story (“we caught one, I would have to look up details”).
  • Red: “We have not had a regression.” Either they have not shipped enough, or they are not detecting them.

3. How do you handle eval drift when a model is silently updated?

  • Green: pinned model versions in production, evals run against new versions in staging before promotion, written promotion checklist.
  • Yellow: versions pinned but upgrade process is ad hoc.
  • Red: “We use the latest model.” No version pinning. A 2024 anti-pattern.

4. Show me the eval suite. I want to read 5 of the test cases.

  • Green: specific, varied, include adversarial inputs and edge cases, clear pass/fail criteria; several derived from real production failures.
  • Yellow: generic happy-path test cases.
  • Red: LLM-as-judge with no ground truth, or none they can show.

If they pass this block, you have probably found a real engineering team.

30–55 minutes: the architecture deep-dive

Twenty-five minutes to test whether the agency designs systems or wires up demos. Production AI failure modes; context-window exhaustion, retrieval drift, tool-call loops, cost runaway, provider-degradation latency spikes; are architecture problems, not prompt problems. An agency that has only ever shipped a chat wrapper will not see them coming. Use a whiteboard or shared diagram.

1. Sketch the architecture of your most recent shipped system.

  • Green: they draw it in real time, label the model-router boundary, show where retrieval lives, show where tool calls execute, and identify failure-mode boundaries.
  • Yellow: pre-existing diagram, but they answer follow-ups crisply.
  • Red: prose only, or a sketch with one box labeled “LLM” connected to “Database.”

2. Where does this system call a model directly vs. Through an abstraction?

  • Green: most model call goes through a router (LiteLLM, custom, or framework-native), enabling swaps, A/B tests, and provider-outage fallback. They can name the last time the abstraction earned its keep.
  • Yellow: thin abstraction, bypassed in places.
  • Red: direct SDK calls everywhere. “We are an Anthropic shop.” A forced rewrite waiting to happen.

3. How does this system handle a 30-second outage from your primary provider?

  • Green: documented fallback to a secondary provider, a local open-weights model, or a graceful “AI features temporarily unavailable” degradation.
  • Yellow: “We would notice and switch manually.”
  • Red: “That has not happened to us.” Anthropic, OpenAI, and Google many had multi-hour incidents in the last 12 months.

4. How is context assembled for a typical request? Walk through a real one.

  • Green: clear separation between versioned system prompt, retrieval (with chunking and freshness policy), typed tool definitions, and sanitized user input; observable in their tracing.
  • Yellow: most of this exists but tracing is rudimentary.
  • Red: “We just send everything to the model.” Architecture of a project that will hit a $40K monthly inference bill and not know why.

5. What changed in your architecture between this project and the one before it?

  • Green: a specific evolution. They generalized a lesson from project N-1 into project N and can articulate the trade-off.
  • Yellow: vague (“we have gotten better at prompts”).
  • Red: nothing changed, or most project starts from a different stack.

A team that has shipped five production AI systems sounds completely different from a team that has shipped 50 prototypes.

55–75 minutes: the post-mortem reading

The block most evaluations skip, and the most diagnostic 20 minutes you will spend. Ask the agency to walk through a real post-mortem; written, dated, with a remediation plan and follow-up status. Postmortems are a forcing function for engineering culture: agencies that operate production systems write them; agencies that ship demos do not. Use a real document on screen.

1. Show me a post-mortem from the last 90 days. Read me the first paragraph.

  • Green: they have one. It is dated, names a real incident, identifies a root cause specifically, and ends with action items that have owners and deadlines. It is not blame-coded.
  • Yellow: they have something post-mortem-shaped; a Notion page or a Slack thread; but it is unstructured and there are no owners on action items.
  • Red: “We do retrospectives in standup.” There is no written artifact. This means there is no learning loop and the same failure modes will recur on your project.

2. What was the most expensive bug you shipped in the last year and how was it caught?

  • Green: a specific story with a dollar number attached (token bill, downtime, refund), a detection mechanism (eval suite, customer report, internal metric), and a structural change made afterward.
  • Yellow: a story without a dollar number, or where the detection was “the customer told us” without a follow-up about why no internal monitoring caught it.
  • Red: “We have not shipped a serious bug.” Either they have not shipped enough, or they are not measuring.

3. Tell me about a time you said no to a client request. Why?

  • Green: a specific case. The client asked for X, the agency declined or counter-proposed because X would have failed evals, blown a cost ceiling, or violated a security boundary. The agency held the line and the project was better for it.
  • Yellow: they pushed back but eventually relented; the project shipped but with a known wart.
  • Red: “We do whatever the client wants.” This is a yes-shop. They will let you ship a bad system because you asked them to.

If you want a structured way to validate post-mortem claims, the check AI developer references guide is a useful follow-up; it gives you the exact questions to ask the previous client about how the post-mortem turned out in practice.

75–90 minutes: the commercial alignment

You have spent 75 minutes on engineering signal. The last 15 minutes test whether the commercial structure is aligned with your interests or extractive against them.

1. Who holds the model API keys on this engagement?

  • Green: “You do. Your Anthropic, OpenAI, Google accounts. Your bill. Our job is to make that bill predictable and small.” They will recommend cost monitoring tools and set a budget alert with you.
  • Yellow: they hold the keys during development and transfer them at handoff. Acceptable if the transfer process is documented.
  • Red: they hold the keys forever and bill you a flat fee that “includes inference.” This is token arbitrage. Your bill is opaque, their margin grows when costs fall, and you cannot see your own usage.

2. Show me a sample line-item invoice from a similar engagement.

  • Green: line items broken out by engineer, day count, and rate. Tools and infrastructure passed through at cost. Inference passed through at cost or billed to your account directly. Any markup is disclosed.
  • Yellow: a milestone-based invoice that maps to deliverables. Acceptable if the deliverables are concrete.
  • Red: a single line item (“Q2 AI development services - $180,000”). No breakdown. No transparency.

3. What happens to the code, weights, prompts, and evals at the decline of the engagement?

  • Green: “Everything is yours. Your repo, your models, your prompts, your evals. We have no claim. We can sign an IP assignment day one.”
  • Yellow: most assets transfer but they retain a “framework” or “platform” they want to keep. Acceptable if the framework is genuinely portable and well-documented.
  • Red: they own the prompts. They own the evals. They own the model fine-tunes. You are renting your own AI system from them indefinitely.

4. What does the off-ramp look like? If we want to bring this in-house in six months, what do you do?

  • Green: a defined transition process. Documentation, knowledge-transfer sessions, a hiring pipeline, and a discount on remaining months if you ramp down. They have done this before.
  • Yellow: they will help but the process is ad hoc.
  • Red: there is no off-ramp. Lock-in is the business model. Walk away.

If the commercial answers contradict the engineering answers; strong evals but opaque billing, sharp architecture but no IP transfer; you have learned something important. The agency knows how to build but has chosen a business model designed to extract. That is a partner you can use only with eyes open and a tight contract; consider walking unless the engineering signal is exceptional. For drafting that contract, the AI agency vetting checklist for CTOs covers the line-item language to insist on.

What the 90 minutes decides

At minute 90 you have a decision, not a feeling. You have seen a real PR, read a real eval suite, looked at a real architecture diagram, read a real post-mortem, and seen a real invoice. Either the artifacts hold up under follow-up questions or they collapse.

The trap most CTOs fall into is mistaking the absence of red flags for the presence of green ones. An agency that did not embarrass itself is not the same as an agency that earned the work. Count the green answers, not the absence of reds. A 12-of-20 agency is a different bet than a 19-of-20 agency, and the cost of getting that wrong is six months of your roadmap.

Treat the 90-minute call as the evaluation that earns you the right to reference checks, contract negotiation, and a paid pilot; not as the whole thing. A two-week paid pilot with fixed scope, real production code, and an eval suite as a required deliverable is the cheapest insurance you can buy. If the agency declines a paid pilot in favor of “going straight to the engagement,” that itself is a signal.

The reason most evaluations take six weeks is that nobody runs them as evaluations; they run them as relationship-building exercises. Relationships are downstream of trust, and trust is downstream of evidence. Get the evidence in 90 minutes. The relationship can build itself on the actual project, where it belongs.


Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has sat in on more than 60 AI agency evaluations on behalf of portfolio companies and clients in the last 18 months.

Frequently Asked Questions

Can you evaluate an AI agency in 90 minutes?

Yes, if the 90 minutes is structured around evidence rather than relationship-building. Five segments; artifacts walkthrough, eval discipline, architecture deep-dive, post-mortem reading, and commercial alignment; surface enough signal to make a decision. The reason most evaluations take six weeks is that nobody runs them as evaluations; they run them as vibe checks. A structured agenda with green/yellow/red answers compresses six weeks of soft signals into 90 minutes of hard ones.

What artifacts should I ask an AI agency to show in a vetting call?

Five concrete artifacts: a recent pull request from a shipped project, the README of that repo’s evals/ directory, the deploy pipeline configuration showing eval gates, an architecture diagram drawn in real time, and a written post-mortem from the last 90 days. Plus a sample line-item invoice from a similar engagement. If the agency cannot produce many six under 90 minutes of time pressure, they either do not have them or do not want you to see them.

What is the single sharpest competence signal when vetting an AI agency in 2026?

Eval discipline. Ask the agency to walk through the eval threshold they set on their most recent project; who chose the number, why that number, and how it ties to a business outcome. The answer reveals whether they understand eval-driven development, whether they shipped production work (not demos), and whether they treat the buyer as a partner in the spec. The tooling (Promptfoo, LangSmith, Braintrust, Anthropic’s eval tooling, OpenAI’s evals API) is mature in 2026 and any agency that does not use it is a 2023 archetype at 2026 prices.

What is a red flag in AI agency commercial structures?

The single biggest red flag is the agency holding the model API keys and billing a flat fee that includes inference. This is token arbitrage: the buyer pays a hidden markup on usage-based costs that should be transparent, the bill is opaque, and the agency’s margin grows when costs fall. The honest structure is direct billing; the buyer’s Anthropic, OpenAI, or Google account, the buyer’s bill, the agency’s job is to make that bill predictable and small. Agencies will resist this structure because it removes their margin on a hidden line item.

How long should each segment of the 90-minute evaluation take?

Allocate 0–10 minutes to artifacts walkthrough (PR, evals directory, deploy pipeline), 10–30 minutes to eval discipline (thresholds, regressions, drift handling), 30–55 minutes to architecture deep-dive (system sketch, model abstraction, fallback, context assembly, evolution), 55–75 minutes to post-mortem reading (last 90 days, expensive bug, saying no to clients), and 75–90 minutes to commercial alignment (API key ownership, line-item invoicing, IP transfer, off-ramp). The artifacts block ends roughly 30 percent of evaluations within the first 10 minutes.

What if the agency cannot share PRs because of NDAs?

An NDA is a legitimate constraint, but it should not stop the agency from showing you something concrete. Acceptable substitutes include a redacted PR description, a synthetic PR built from a public repo using the same patterns, or a screen-share of an internal tool the agency built for itself. The red flag is the agency pivoting to a deck or saying their PRs are too sensitive to discuss even at a high level. Real engineering teams talk about code at the structural level without leaking client secrets.

Should I run a paid pilot after the 90-minute evaluation?

Yes. The 90-minute call earns the agency the right to a paid pilot, not the full engagement. A two-week paid pilot with fixed scope, fixed price, real production code, and an eval suite as a required deliverable is the cheapest insurance you can buy. It tells you whether the artifacts you saw on the call were representative or curated. If the agency declines a paid pilot in favor of going straight to the full engagement, that itself is a signal worth taking seriously.

How is this 90-minute evaluation different from a standard procurement vetting checklist?

Standard procurement vetting collects soft signals; case studies, founder LinkedIn profiles, polished decks, reference quotes; over multiple weeks and produces a feeling rather than a decision. The 90-minute field guide replaces soft signals with falsifiable evidence: real PRs, real eval suites, real architecture sketches, real post-mortems, real invoices. Each question has a green, yellow, or red answer, so the output is a count, not a vibe. It is meant to run in parallel with a vetting checklist, not replace one.

Last Updated: May 20, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles