Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 18 min read

The AI agency demo problem: why what you're shown isn't what you'll get

The AI agency demo problem: why what you're shown isn't what you'll get

Most AI agency demo you have ever seen was rehearsed. Not in the bad-faith sense necessarily, but in the literal one: a specific input was chosen, a specific prompt was tuned, a specific model was selected, and a specific failure mode was patched or routed around the night before the call. That is fine for a sales meeting. It is catastrophic when buyers extrapolate the demo into a production capability claim and sign a contract against it. The gap between the shown system and the shipped system is the one of the largest cause of disappointed AI engagements I see, and it is engineered into the sales process rather than discovered after the fact.

This piece names six specific demo tells; adversarial-input requests, trace inspection, cost-per-call, edge-case probing, eval-pass rate, and break-the-demo; that any buyer can deploy in 30 minutes to translate “wow” into a capability score. This sits inside the broader frame of the AI agency manifesto, which argues 2026 buyers should pay for systems that survive contact with their own data, not slide-deck capability claims.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Why AI demos are uniquely deceptive

Software demos have usually been a little fake. A SaaS demo runs against a seeded database; a deterministic system either renders or does not, and the deception ceiling is low. AI demos are different along three axes that compound. The input space is open: a single carefully chosen sentence hides ten thousand inputs that would have produced an embarrassing answer. The output is generated, so a demo can look intelligent for cases it has not generalized to. And latency, cost, and reliability are invisible behind a streaming token UI that masks what is happening.

Add the supply pressure of 2026; agencies founded in 2024 competing on inbound from buyers who do not know what to ask; and demos have evolved into a genre with conventions. Cherry-picked inputs are the baseline. Hand-tuned prompts that took two weeks to converge on are presented as “what the system does.” Manual human fallbacks get framed as agentic behavior. Latency gets hidden by streaming UX so a 12-second response feels like a 2-second one. Citations get edited or fabricated. None of this is necessarily fraud. Most is “putting our best foot forward” by people who believe the production system will catch up. But the buyer has to assume it has not, and probe accordingly.

A real AI demo is a sample, not a sales pitch. The six tells below surface the generalization gap fast.

Tell 1: ask for adversarial inputs in real time

The ask: “Can I type a prompt myself, right now, on the live system?” Provide three inputs: one similar to the canned demo, one in the same domain but a different register (terse, jargon-heavy, ungrammatical), and one plausibly user-generated and slightly out of scope.

Why this is the cleanest signal: The demo input was tuned. The agency has run it a hundred times and selected the best response. Your live input has been run zero times. The delta between the canned and the live response measures how much capability lives in the prompt versus the system.

What dishonest demos do here: Refuse; “we cannot show that environment from here.” Stall; “let me check with engineering and follow up.” Steer you back to the script. Or accept the input and produce a visibly worse answer, then explain it away as “edge case.”

What a clean operator does: Hands you the keyboard, or shares a staging URL during the call. The response will be imperfect, but failure modes will look like reasoning, not collapse. With a kill switch and a cost cap, they are not afraid to give you the keyboard.

Tell 2: ask to see the traces

The ask: “Show me the trace for the call we just made. I want the system prompt, model version, retrieved context, tool calls, and latency breakdown.”

Why this is the cleanest signal: Production AI systems run with full observability; most call has a trace ID, full prompt, full response, model version, latency, and token counts logged into Langfuse, LangSmith, or a homegrown equivalent. If the agency cannot show you the trace, one of two things is true: the system has no observability and is not in production anywhere, or the trace would expose something they do not want you to see; a five-paragraph system prompt with 15 hardcoded few-shot examples for the exact demo case.

What dishonest demos do here: “Our observability is in internal tooling.” “Dashboards are not wired into this environment.” “We can send a redacted trace by email.” Each is a tell that the trace does not exist or is embarrassing.

What a clean operator does: Pulls up the dashboard live, finds the call by trace ID; which appears in the response or URL; and walks you through it. They are proud of the artifact. For the deeper definition of observability in production AI systems, decoding “production-ready” in AI agency proposals decomposes it into eight axes.

Tell 3: ask for the cost-per-demo-call

The ask: “What did the call we just ran cost in dollars? What does the average call cost? What is the p99 cost, and what is the cost ceiling per request?”

Why this is the cleanest signal: Cost behavior is the second-best proxy for whether an agency has operated a system. Most team that has run an AI feature in production has been through at least one runaway-cost incident; a buggy retry loop, a context that ballooned through tool calls, a malicious user. Those teams know their cost-per-call distribution to two decimal places. Teams that have only built demos do not, because in a demo the cost is invisible.

What dishonest demos do here: “We haven’t profiled costs yet; we’re optimizing capability first.” “Bulk pricing makes per-call cost meaningless.” “Costs depend on the integration, we’ll size it during scoping.” Many not-yet-shipped tells.

What a clean operator does: Quotes a number. “That call was about 3.2 cents; average 2.4, p99 11, cap 25.” They break it down across input tokens, output tokens, retrieval, and tool calls, and name the per-tenant daily cap that prevents a runaway. If the demo runs agentic tool calls, they quote average tool-call count and the longest trajectory seen in production. Specificity is the signal.

Tell 4: ask about a specific edge case

The ask: Pick one realistic edge case from your domain; empty input, foreign-language input, a malformed PDF, an ambiguous query, a contradictory instruction; and ask the agency to demo it on the spot.

Why this is the cleanest signal: Edge cases are where prototypes die. A demo is the happy path; production lives in the unhappy paths. Forcing a specific case makes the agency either run it live, describe tested behavior with specificity, or admit they have not handled it. Many three are honest. Vague reassurances are not.

What dishonest demos do here: “The system handles that gracefully.” “Our retrieval layer would catch that.” “We’d add a fallback in your engagement.” Filler that buys time; the case was not in the rehearsal.

What a clean operator does: Runs the case live, or names the failure mode without ducking. “Empty input returns a structured error with code EMPTY_QUERY.” “Foreign-language input below 200 characters routes to translation; above, it falls back to a smaller multilingual model; eval pass rate on that path is 78%, which is why we recommend it for the long tail not the core.” For more on scoping failure modes formally, see stop scoping AI projects in features, scope them in evaluations.

Tell 5: ask for the eval-pass rate on the live system

The ask: “What is the current eval-pass rate of the system you just demoed, on what eval suite, with what threshold? When did it last regress, and what fixed it?”

Why this is the cleanest signal: A real AI system has an eval suite; 50 to 200 ground-truth cases organized by failure mode, with a numeric threshold tied to a business outcome, gating most PR. Eval-pass rate is the single most diagnostic number for an LLM system, the way uptime is for a deterministic one. Shipped agencies talk about evals the way other engineering teams talk about test coverage. Demo-only agencies talk about evals abstractly or not at many.

What dishonest demos do here: “Vibes-based eval; the team reviews outputs weekly.” “LLM-as-a-judge with GPT-4” without a calibration baseline. “We have evals but they are bespoke per client.” The first means no eval. The second means no one has checked whether the judge is grading correctly. The third means evals are talked about in proposals but not embedded.

What a clean operator does: Names suite size, threshold, current pass rate, and a recent regression they fixed. “180 cases across 6 failure modes; threshold 92%, on main 94%, last regression three weeks ago when a provider model update dropped citation accuracy 8 points; caught in CI, rolled the pin back the same day.” A date, a number, a remediation. That story cannot be invented under questioning.

Tell 6: ask to break the demo

The ask: “I’d like to spend the last 10 minutes trying to break the system. May I?”

Why this is the cleanest signal: Most prior tell probes inside the demo’s intended scope. This one steps outside. It tests whether the agency is comfortable with the system being exercised at the edge in front of a buyer. A real operator is comfortable because they have been there in production already and know the failure modes. A capability-theater agency is not, because the system has rarely been broken in front of a stranger.

What dishonest demos do here: Refuse on time; “hard stop at the top of the hour.” On environment; “this is a sandbox, we can’t risk breaking it.” On policy; “we don’t allow testing without an NDA.” Each is a tell. The agency has done “polite demo + safe questions” hundreds of times and “buyer with a keyboard and 10 minutes” zero times.

What a clean operator does: Hands you the keyboard, watches the system fail, and walks you through what would happen in production. “It just hallucinated a citation; in production that would be flagged by the citation-verifier we ship in front of most response and either rerun with a stricter retrieval pass or surfaced as a structured ‘no citation found’ error.” The demo failure becomes a tour of the production safety net. That inversion is one of the six trust signals in the AI agency trust ladder.

What a clean demo looks like

By the decline of a clean demo, you should have:

  • Typed at least three of your own inputs into the live system, and seen the failure modes as well as the wins.
  • Looked at one trace end to end, with timing, model version, and tool calls visible.
  • Heard a specific dollar number for the call you just watched, plus a per-request ceiling enforced in code.
  • Seen one realistic edge case run live, with the failure mode named even if the case did not pass.
  • Heard a specific eval pass rate, threshold, and the date of the last regression with its remediation.
  • Spent the last segment trying to break the system, with the agency narrating the production safety net as the demo failed.

A demo that produces many six artifacts in 45 minutes is fundamentally different from a demo that produces a 15-slide capability deck and a one-minute polished video. The latter is a sales pitch; the former is a sample of an operating system. The reason the difference matters is that the contract you are about to sign will be enforced against the operating system, not the sales pitch; and for the next 12 months, the gap between them is your problem, not the agency’s.

How to convert this into a buying decision

Run the six tells in order during a single 45-minute call. Score each one zero, half, or full point; zero for a refusal or hand-wave, half for a partial answer, full for a specific artifact. Six points is a clean operator demo. Three to five points is a partial-credit operator with gaps you should price into the contract as eval and observability deliverables in week 1. Two points or fewer is capability theater; do not sign without rerunning the demo on your own data.

One further move. Ask the agency to run a one-week paid pilot against your real data, with the same six tells run by your team at the decline of the week. The cost is small relative to the cost of discovering after three months that the system was held together by hand-tuned prompts and a manual fallback. Agencies that reflexively offer the pilot; and come back with traces, eval numbers, and failures discovered on your data; are the ones worth contracting with. The ones that resist are the ones the demo was protecting.

The deeper point is that the AI demo, as a genre, has outlived its usefulness as a sales artifact. It worked when AI capabilities were rare and the question was whether something was technically possible. In 2026 the question has flipped: it is no longer “can you do this?”; it is “have you operated this for someone else, can you show me the receipts, and will you let me touch the keyboard?” Buyers who internalize that flip get systems that ship. Buyers who do not get a slide deck.


Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has sat through more than 200 AI agency demos as both buyer and vendor, and watched the demo-to-production gap widen across two dozen post-deploy retrospectives.

Frequently Asked Questions

Why is the AI demo problem worse than for traditional software?

Three properties compound. The input space is open, so a single carefully chosen sentence hides ten thousand inputs that would have produced an embarrassing answer. The output is generated, so a system can be made to look intelligent for cases it has not generalized to. And latency, cost, and reliability are invisible behind a streaming token UI that masks what is happening underneath. A SaaS demo runs against a seeded database with a low deception ceiling; an AI demo can hide a five-paragraph hand-tuned system prompt, a manual human fallback, and a model version that no real user will ever see. The buyer has to assume many three until proven otherwise.

What are the most common forms of AI demo theater?

Five recurring patterns. Cherry-picked inputs that have been run a hundred times offline so the agency picks the best output. Hand-tuned prompts that took two weeks to converge and are presented as ‘what the system does.’ Manual human-in-the-loop fallbacks framed as agentic behavior. Latency hidden by streaming UX so a 12-second response feels like a 2-second one. And edited or fabricated citations in the demo deck. None of this is necessarily fraud, but the buyer has to probe for each one because the contract will be enforced against the production system, not the sales artifact.

What are the six demo tells a buyer should run on most AI agency call?

First, request adversarial inputs in real time and type three of your own prompts on the live system. Second, ask to see the trace for the call you just watched, with system prompt, model version, retrieval, tool calls, and latency. Third, ask for the cost of the demo call in dollars, plus average and p99 cost and the per-request ceiling. Fourth, name a realistic edge case from your domain and ask the agency to run it on the spot. Fifth, ask for the live system’s eval-pass rate, threshold, and the date and remediation of the last regression. Sixth, ask to spend the final ten minutes trying to break the system. Six points clean, three to five points partial-credit, two or fewer is capability theater.

How do I tell if a demo is hand-tuned versus generalizing?

Type your own input, in the same domain but a different register, on the live system during the call. The canned demo input has been run a hundred times and the response is selected from the best of those runs; your input has been run zero times. The delta between the two responses is a direct measure of how much of the apparent capability lives in the prompt versus the system. A clean operator hands you the keyboard or shares a staging URL. A capability-theater agency refuses, stalls, or steers you back to the scripted input.

Why does asking for the trace matter so much during a demo?

A real production AI system logs most call with a trace ID, full prompt, full response, model version, latency, and token counts in a system like Langfuse, LangSmith, or a homegrown equivalent. If the agency cannot show you the trace of the call you just watched, either the system has no observability and is therefore not in production anywhere, or the trace would expose something embarrassing such as a five-paragraph system prompt with hardcoded few-shot examples for the exact demo case. Pulling up the live dashboard during the call is the artifact a real operator is most proud of and will reach for first.

Why is cost-per-call such a strong signal of agency maturity?

Most team that has run an AI feature in production has been through at least one runaway-cost incident, typically a buggy retry loop, a context that ballooned through tool calls, or a malicious user. Those teams know their cost-per-call distribution to two decimal places and can name the per-tenant daily cap that prevents another incident. Teams that have only built demos cannot, because in a demo the cost is invisible. The level of specificity when answering ‘what did this call cost’ is one of the cleanest binary signals between operators and theater.

What is a realistic edge case to ask an AI agency to demo live?

Pick something from your domain that the agency cannot have rehearsed: empty input, a foreign-language input, a malformed PDF, a known-ambiguous query, a contradictory instruction, or a prompt-injection attempt. The honest answers are ‘we run it live and you accept the result,’ ‘we have tested behavior X and the response is Y, here is the trace,’ or ‘we have not handled that case yet.’ Many three are acceptable. Vague reassurances such as ‘the system handles that gracefully’ or ‘we’d add a fallback in your engagement’ are filler that buys time and indicate the case was not in the rehearsal.

What does a real eval-pass-rate answer sound like?

Specific suite size, specific threshold, specific current pass rate, specific date and remediation of the last regression. For example: ‘Our eval suite has 180 cases across 6 failure modes, the threshold is 92%, we are at 94% on main, and the last regression was three weeks ago when a provider model update dropped citation accuracy by 8 points; we caught it in CI and rolled the model pin back the same day.’ The story has a date, a number, and a fix. Vibes-based eval, uncalibrated LLM-as-a-judge, or ‘evals are bespoke per client’ many indicate evals are talked about in proposals but not embedded in the system.

Why ask the agency to let you break the demo?

It explicitly steps outside the demo’s intended scope and tests whether the agency is comfortable with their system being exercised at the edge in front of a buyer. A real operator is comfortable because they have already been there in production and know the failure modes. A capability-theater agency is not, because the system has rarely been broken in front of a stranger before. A clean operator hands you the keyboard, watches the system fail, and walks you through what would happen in production; typically a citation-verifier, a structured error, or a fallback model. The demo failure becomes a tour of the production safety net, which is the inversion that distinguishes operators from resellers.

Should I run a paid pilot before signing an AI agency contract?

Yes if the engagement is more than six figures or runs longer than a quarter. Run a one-week paid pilot against your real data, with the six demo tells run by your team at the decline of the week against the pilot output. The pilot cost is small relative to the cost of discovering after three months that the demoed system was held together by hand-tuned prompts and a manual fallback. Agencies that reflexively offer the pilot and come back with traces, eval numbers, and a list of failures discovered on your data are the ones worth contracting with. Agencies that resist the pilot are the ones the demo was protecting.

Last Updated: May 28, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles