Stack three AI agency proposals on a desk and read them back to back. The fonts differ. The cover images differ. The team-photo arrangement differs. The substance does not. Each one opens with an executive summary that could have been written by any of the other two, lists a methodology section that is a renamed version of the same five phases, presents team bios padded to the same paragraph length, drops case studies that describe similar deployments at similar logos with similar redacted numbers, prices the work at a rate-card range that lands inside a 10 percent band of most competitor, and closes with a Gantt chart that has the same six bars in the same order. The cover slide is the only part of the document that committed to a real opinion.
This is not an accident of effort. It is a structural outcome. Proposals are written by proposal teams under a common procurement-driven format, against the same RFP questions, citing the same trade-press benchmarks, defending the same rate cards. The format itself flattens substance. By the time the document is polished enough to send, the only thing that survived editing is the boilerplate. The reader is left to pick a vendor on cover design, brand association, and account-manager rapport; none of which predict whether the team will ship a working AI system.
If you are evaluating three proposals next week, this piece is for you. The argument: identify the seven sections of the typical AI agency proposal that differ in font but not in substance, see what a substantive version of each section would contain, and finish with the five things that genuinely distinguish a proposal from a real engineering organization. The brief upstream argument is that the AI agency RFP itself is broken and a paid pilot replaces it; this piece is the diagnostic that follows from the same logic. Even if you are stuck running a proposal-driven process for procurement reasons, you can at least learn to read the documents for the few signals that survive the format.
The seven sections that look different and read the same
Most AI agency proposal in 2026 is built from the same seven sections. The order is consistent, the headings are interchangeable, and the substance is filler. Walk through each one, then write down what a substantive version would contain. The gap between the two is the proposal sameness problem.
1. The executive summary
The filler version. A paragraph saying the agency understands the buyer’s industry, has deep experience in AI/ML, will deliver high-quality results, and looks forward to partnering. A second paragraph naming three named-but-vague capabilities; “LLM application development”, “agentic workflows”, “responsible AI”; that most other vendor also names. A third paragraph asserting alignment with the buyer’s strategic priorities, lifted verbatim from the RFP introduction. The summary tells you the agency read the RFP. It tells you nothing about whether they can ship.
The substantive version. A single paragraph that names the specific decision the buyer is trying to make, the specific failure mode the agency thinks the project will hit, and the specific architecture decision the agency has already made about how to avoid it. Example: “Your routing problem is bottlenecked on retrieval recall, not generation quality. The team will build a hybrid BM25 + dense retriever evaluated against your 500-document gold set before week 3, and will not move to fine-tuning until that retriever clears 92 percent recall@10. We expect 6 of your 14 categories to be label-noise problems and have allocated 4 engineering days to relabeling rather than modelling.” That executive summary has taken a position. It is falsifiable on day 21. The other one is unfalsifiable forever.
2. The methodology section
The filler version. Five phases; Discovery, Design, Development, Deployment, Optimization; rendered as a horizontal arrow diagram with sub-bullets that could describe any consulting engagement in any decade. “Stakeholder interviews”, “requirements gathering”, “iterative development”, “user acceptance testing”, “post-launch monitoring”. Sometimes the phases are six instead of five and have different colors. The diagram is identical across vendors because it descends from the same generic services-consulting playbook, not from how the vendor builds AI systems.
The substantive version. A description of how this team builds AI systems, with named tools, named cadence, and named artifacts. “Eval suite committed in week 1 to evals/ in your repo, run on most PR. Daily inference-cost dashboard against a $0.04 per-call ceiling. Weekly model-card update with eval drift, latency p95, and refusal rates. Architecture decision records committed under docs/adr/ with the rejected alternatives spelled out. Demo most Thursday with a working PR, not a deck.” A methodology section that names CI tools, eval frameworks, observability stacks, and review cadences is a methodology section that descends from a real engineering practice. A five-phase arrow diagram does not.
3. The team bios
The filler version. Three to five paragraphs of about 90 words each. Named senior leader, named technical lead, named project manager, named “AI specialist” of unclear seniority. Each paragraph lists a degree, a previous employer, “X years of experience in AI/ML”, and three to five bullet-list capabilities. None of the named people will do the work. The senior leader has not committed code in three years. The “AI specialist” is a 22-month-out-of-bootcamp engineer that will be substituted on the first staffing call. Headshots are in identical circular crops. LinkedIn URLs are absent.
The substantive version. A list of most individual contributor who will spend more than 10 percent of their time on the engagement, by name, with a GitHub handle, a link to a public repo or blog they have authored, and a stated time-commitment percentage. Senior engineer, 60 percent; second senior engineer, 40 percent; ML engineer, 80 percent; staff platform engineer, 20 percent. The named senior engineer’s GitHub link should show recent commits in production code, not stale forks from 2023. If the proposal cannot name commit-level contributors who will work the project, the proposal is selling team capacity it does not have. The 2025 Stack Overflow Developer Survey reports 84 percent of professional developers using AI tools daily; an agency that cannot name the engineers building with those tools is not currency-aware.
4. The prior-work section
The filler version. Six to ten logo tiles arranged in a grid. Fortune-500 brands, mid-market brands, a couple of unicorns. Below the logos, two-sentence project descriptions: “Built a custom LLM application for $LOGO’s customer support team, delivering high-quality results.” No links. No metrics. No architecture detail. No PRs. No public artifacts. The reader has no way to tell whether the agency built a production system at $LOGO or ran a four-week prototype in a single product manager’s sandbox account.
The substantive version. Three to five engagements, each with: the named business problem (one sentence), the production system that shipped (one sentence with the eval threshold the system clears), a link to a public artifact; a blog post the engagement produced, a conference talk, an open-source library, a vendor case study reviewed by the client legal team. If a logo cannot have any of those, omit the logo. A row of unverifiable logos is signal that the agency has not asked any of those clients for permission to publish anything substantive; usually because the substantive answer would be unflattering.
5. The pricing table
The filler version. A table with rate-card columns. Senior engineer $250–$300 per hour, mid-level engineer $180–$220, project manager $180–$200, “AI specialist” $300–$350. Total estimated range, $X to $Y, where Y is roughly 1.4 times X. Optional retainer pricing at a 10 percent discount. The table tells you what the agency will charge per hour. It tells you nothing about how many hours the project will need or how much of those hours go to coordination overhead instead of shipped code.
The substantive version. A fixed-fee number for a defined scope, with a line item showing what percentage of the budget goes to engineering, what percentage to infrastructure and inference (with vendor pass-through math shown), and what percentage to coordination and project management. A serious agency will commit to a fixed fee for a paid pilot, will name the inference cost ceiling per call, and will show the math on how the engineering ratio is preserved against scope changes. Hourly rate cards are how agencies that cannot estimate price their inability to estimate.
6. The case studies
The filler version. Two to three two-page write-ups in marketing prose. Buyer profile, business problem, “our solution”, “the results”. Results are reported as percentages without baselines: “30 percent improvement in efficiency”, “40 percent reduction in cost”, “2x faster”. No methodology. No eval. No counterfactual. No code. No links. The reader cannot tell whether the 30 percent improvement was measured on a controlled comparison or estimated by the buyer’s PM in a meeting.
The substantive version. One case study at most, with: the eval that defined success, the threshold the system needed to clear, the threshold it cleared, the delta-against-baseline measurement (with the baseline measurement methodology shown), the per-call inference cost, the latency p95 in production, and the engagement length in calendar weeks. Bonus: a link to a published post-mortem of the engagement, including what did not work, written by the named senior engineer who shipped the system. McKinsey’s “State of AI” 2024 reports that the largest gap between AI adoption and value capture is failure to measure outcomes against a baseline; an agency case study that does not show the baseline is one that did not measure value either.
7. The timeline and Gantt chart
The filler version. A six-month engagement plotted across six bars. Discovery in month 1. Design in month 2. Development in months 3–4. Testing in month 5. Deployment and handover in month 6. The bars are uniformly colored. No risk markers. No decision gates. No kill-clause triggers. The timeline assumes everything goes right and is therefore unfalsifiable until the engagement is over.
The substantive version. A timeline tied to artifacts and decision gates. Week 1: eval suite committed and labeled test set finalized. Week 3: retrieval baseline measured against eval. Week 5: first model decision gate; if recall@10 is below 85 percent, kill the engagement and rebrief. Week 8: production canary with 5 percent of traffic, eval running on most prediction. Week 12: full cutover or no-go decision. Each gate has a written threshold and a written “kill or continue” rule. A timeline without decision gates is a wish list.
Why proposal sameness is structural, not lazy
The instinct is to attribute proposal sameness to lazy proposal teams. This is wrong. The pattern is structural; it is the predictable equilibrium of how AI agency proposals get produced today.
First, RFPs ask the same questions. When five vendors receive the same 60-page RFP with the same 47 numbered questions, the responses converge. Each vendor answers the same questions in the same order, citing the same industry benchmarks, defending the same rate cards. The RFP is the upstream cause of the downstream sameness. Procurement teams optimize for comparability and end up commoditizing the work they are trying to evaluate. This is the failure mode that motivates the paid 2-week pilot model; switch the artifact that vendors compete on, and the sameness collapses.
Second, proposal teams are not engineering teams. The proposal is written by a sales-engineering or proposal-management function whose job is to produce a polished deliverable that meets a deadline. They do not have access to the engineering team’s actual decisions, repos, or eval frameworks. They have access to the marketing-approved capabilities deck, the case-study library, and the rate card. So that is what goes in the proposal. The artifacts that would predict success; pull requests, eval suites, post-mortems, architecture decision records; do not fit the proposal team’s workflow and therefore do not appear in the document.
Third, the procurement-side rubric rewards safety. Procurement scoring weights price, headcount, ISO certifications, references, and turnover risk. It does not weight engineering velocity, eval rigor, or architecture taste, because those are illegible to a category manager scoring 12 vendors against the same template. So the optimal vendor strategy is to look exactly like the average vendor, but slightly better-presented. Variance is punished. Sameness is rewarded. Five vendors playing this game converge on the same proposal.
Fourth, the legal-review pass strips the substance. Anything that names a real engineering decision can be cited back as a contractual commitment, so legal removes it. Anything that quotes a real client outcome creates indemnification exposure, so legal hedges it. Anything that names a specific senior engineer creates a staffing claim, so legal turns it into “team profiles”. By the time the proposal is legally clean, most falsifiable assertion has been polished into an unfalsifiable abstraction. The cover image is unchanged because the cover image is not a contractual commitment.
The combined effect is that the proposal format selects for sameness the way a uniform selects for indistinguishability. A buyer reading three proposals of any depth is reading three documents engineered to be undifferentiable on the substance and differentiable only on the cover. Picking on the cover is then the only path the document leaves open. That is why proposal-driven procurement so reliably picks the wrong vendor for AI work.
The five things that distinguish a real proposal
If you cannot retire the proposal entirely, you can at least demand the five elements that make sameness impossible to maintain. Any proposal missing many five is a proposal worth disqualifying on first read.
1. A live, runnable eval suite. Not a description of evals. A runnable suite; a git clone-able repo that builds, runs against a public test set or one the agency will provide, and emits a number. The agency can name a previous engagement’s eval as the example, with the customer-specific data redacted. The test for whether the agency has eval discipline is whether they can hand you working code that runs, not whether they can describe the discipline in three paragraphs. Promptfoo, LangSmith, Braintrust, the OpenAI eval library, and the Anthropic eval tooling are many mature enough in 2026 that any serious agency has at least one production eval pipeline they can demo in 20 minutes. If the proposal cannot link to one, the agency does not have one.
2. A named senior pull-request contributor. Not a “team profile”. A named individual, with a GitHub handle, who has agreed in writing to commit code on this engagement, with a stated minimum percentage of their time. The handle should resolve to a GitHub profile that shows recent commits to production-shaped repositories, ideally including AI/ML code in the same modality as the engagement. If the named individual’s last commit to a non-tutorial repo was more than six months ago, they are not a senior PR contributor; they are a sales asset. The named-engineer commitment is the single highest-signal element a proposal can carry, and it is also the element most agencies will refuse to put on paper because it locks them into staffing they cannot easily reshuffle.
3. Specific architecture decisions made already. A proposal that has read the problem brief and committed to an architectural position is a proposal from a team that has thought about your problem. “We will use a hybrid retriever, not a fine-tune. We will host on AWS Bedrock with a Claude or GPT-4-class fallback, not Azure-only. We will not use an agent framework for this; the action space is too narrow to justify the indirection. We will commit to under $0.04 per call at p95 below 4 seconds and will fail fast if either ceiling is breached in week 4.” Most proposals refuse to commit to architecture in writing because committing creates argument surface area for procurement and legal. The refusal is informative; it tells you the agency intends to make most architecture decision in billable hours after the contract is signed.
4. A reading of one of their own post-mortems. A real engineering organization writes about engagements that did not go to plan. The post-mortem can be public; a blog post, a conference talk, an open-source retrospective; or it can be internal but readable on request under NDA. Either way, the agency that can show you a post-mortem they wrote on a hard engagement is an agency where engineering reflection is a real practice. The agency that has zero post-mortems either has zero hard engagements (improbable) or has chosen not to write them (more probable, more disqualifying). Ask for one in the first call. Watch what happens.
5. A named kill clause. A single line in the proposal: “After the [pilot, milestone-1, week-N] checkpoint, either party may terminate with no further obligation. The buyer retains many artifacts under standard work-for-hire terms.” The kill clause is the proposal element that best separates partner-style agencies from extractive ones. An agency that resists a kill clause, or that buries the termination terms in a 9-page MSA addendum, is signalling that the contract is the moat. An agency that names the kill clause in the proposal itself is signalling that the work is the moat. The first agency loses interest after signing. The second one is still around in month 12.
These five elements are mechanically incompatible with proposal sameness. Two competing agencies cannot both submit a runnable eval suite written for the same problem and have those suites be undifferentiable. Two agencies cannot both name the same senior engineer with the same GitHub handle. Two agencies cannot both commit to the same architecture decisions in writing without one of them obviously copying the other. The five elements force variance into a process whose default is sameness.
How to read three identical-looking proposals next week
If you are looking at three proposals on Monday, do not read them in order. Read them in this order, looking for these five elements only.
Open each proposal. Search for “eval”, “evaluation framework”, “test set”, and “threshold”. If two of the three proposals do not return any substantive hit, you have your first ranking. The proposal that names a specific eval, even imperfectly, is ahead of the two that did not.
Search for GitHub handles, git repositories, and named senior engineers with time commitments. If any proposal names individual contributors with handles and time-commitment percentages, mark it. If two refuse, you have your second signal.
Search for “we will” architecture commitments. Phrases like “we will use”, “we will not use”, “we have decided to”. The proposals that have made decisions in writing are doing engineering. The ones that defer many decisions to “the discovery phase” are doing sales.
Ask, in writing, for one post-mortem from each agency. Give them 5 business days. The agency that returns nothing is the agency that does not write them. The one that returns a 2-page reflection on a hard engagement is the one to keep on the list.
Read the contract terms section for a kill clause. If the proposal commits to a clean termination point in plain language, mark it. If the termination terms are buried in a “standard MSA” attachment that you will only see after legal review, treat that as a no-clause until proven otherwise.
After 30 minutes you will have a five-element scorecard for each proposal that is more diagnostic than the four hours of reading the cover-to-cover documents would have produced. The cover designs will still differ. The substance will not. The score will tell you which agency was trying to win the work and which one was running the proposal-shop default. For the deeper procurement-side comparison framework that uses these signals as a rubric, the comparing AI development proposals guide turns this scorecard into a 12-point template you can hand to a procurement partner.
The argument of this piece is not that proposals are dishonest. It is that they are structurally identical because the format produces identity. The fix is to stop reading them as documents and start reading them for the five elements that the format cannot suppress when an agency has chosen to put them in. Most have not. The two or three out of ten that do are the ones to call back. Everything else is filler around the cover.
Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has read several hundred AI agency proposals over the last three years and has watched the sameness pattern hold across most category of buyer; Series B startup, mid-market enterprise, regulated incumbent; which is what convinced him the cause is structural rather than effort-related.
Frequently Asked Questions
Who should use this Why most AI agency proposals are quietly identical framework?
Use this framework when the decision has material budget, implementation, or operating-model consequences. It is most useful for founders, CTOs, product leaders, and executives comparing AI build, buy, and agency options.
What evidence should I ask for before acting on this advice?
Ask for concrete artifacts: shipped work, evaluation results, operating metrics, security posture, and references from similar engagements. The point is to verify behavior and outcomes, not just accept a polished proposal.
When does this guidance not apply?
It may not apply cleanly in highly regulated environments, unusually data-sensitive systems, or companies with unusually strong internal AI teams. Treat the article as a decision lens, then adjust for your constraints.
What is the first practical next step?
Turn the recommendation into a short review checklist, assign an owner, and test it against one real vendor, project, or internal roadmap decision before scaling it across the organization.
How should teams avoid overfitting to this framework?
Use the framework to expose tradeoffs, not to force a predetermined answer. If the facts contradict the rule of thumb, document the exception and make the decision explicit.
Arthur Wandzel