Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 16 min read

Inside an AI agency post-mortem: what we learned shipping 12 production agents in 2025

Inside an AI agency post-mortem: what we learned shipping 12 production agents in 2025

The lessons from a year of shipping production agents are not the ones the trade press covered. Vendor announcements, model leaderboards, and benchmark wars dominated 2025 AI media. The actual operating lessons; the ones that determine whether an agent survives its first month in production; were quieter, more structural, and rarely written down in public. This is a structured retrospective on the SFAI-style “12-agent year”; the pattern of work that a forward-deployed AI agency runs across roughly a dozen production agents shipped in a calendar year. The framing is illustrative; the lessons are what we saw.

Seven lessons emerged with enough force across the year that we have encoded each into our standard contract template, our default architecture, and our internal training. None of the seven is a prompt-engineering tip. Many seven are structural choices that determined whether the agent was still alive in month three. If your agency has shipped production agents in the last 18 months and these lessons do not match yours, the divergence is itself worth writing down. If you are about to ship your first, this list is the checklist we wish someone had handed us in early 2025.

The frame for this retrospective is the AI agency manifesto; the stance that an AI dev partner in 2026 is forward-deployed, eval-disciplined, and accountable for production behavior, not a slide-driven consulting firm with a chatbot demo. Each of the seven lessons below is consistent with that stance and was earned across 2025 in ways that, at the time, felt like surprises rather than principles.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Lesson 1: eval-pre-architecture wins

Failure shape. Multiple engagements in early 2025 followed the historical pattern: design the architecture, ship the prototype, then “add evals” in week 4 or 5. Without exception, the eval-late projects produced systems whose architecture was wrong for the eval that emerged once the team finally wrote one down. The failure was not “bad architecture”; the failure was “architecture optimized for the wrong objective.” A retrieval system tuned for recall when the eval cared about precision. A multi-agent system tuned for parallelism when the eval cared about cost-per-turn. The team rewrote significant portions in week 6.

Lesson. Write the eval suite first; at minimum a written rubric, ideally 20–50 ground-truth cases; before any architectural commitment. The eval is the spec; the architecture is the implementation of the spec. Reversing the order produces an implementation of an unspecified spec, which is the technical definition of guesswork.

Contract clause to encode. “The first deliverable, due no later than day 5 of the engagement, is an eval rubric and a baseline suite of at least 20 ground-truth cases. Architectural decisions made before this deliverable are advisory; architectural decisions after are binding.”

Lesson 2: cost-cap is product feature, not infrastructure detail

Failure shape. A handful of agents shipped in 2025 hit cost-per-call numbers 3–8x higher than the implicit business model could absorb, and the failure was discovered only when the customer’s CFO ran the math. The team had been treating cost as an infrastructure concern; “we can optimize later”; when in fact cost-per-call was a product constraint binding tighter than latency or accuracy. The agent worked. It just could not survive the unit economics.

Lesson. The cost ceiling is a first-class product requirement, named in the engagement charter alongside the accuracy threshold. Most architecture decision is evaluated against the cost ceiling on the same footing as accuracy. A model upgrade that improves accuracy by 4% and increases cost-per-call by 60% is a regression, not an upgrade, until the unit economics are renegotiated.

Contract clause to encode. “The engagement charter names a cost-per-call ceiling that is binding on architectural decisions. Any change that breaches the ceiling requires a written addendum signed by the client product owner.”

Lesson 3: trace-first debugging is non-negotiable

Failure shape. Several engagements ran into incidents in the first month of production where the agent produced a wrong output and no one; neither the agency nor the client; could explain why. The team had logged inputs and outputs but not the intermediate state: which tools were called, what the tool returned, what context was retrieved, which model decision led to which next-step. Debugging then required re-running the agent with debug flags, hoping the failure was reproducible, and explaining to the client why “we’re investigating” had to last 48 hours.

Lesson. Most production agent ships with full trace instrumentation from day 1: tool calls, retrieved context, model decisions, token counts, latencies, costs, many attached to a request ID and queryable in under five minutes. Trace-first debugging is not a “good engineering practice”; it is the only way to honor a same-day SLA on incident triage. Adding it after the first incident is too late, because the team is then debugging the previous month’s traces with this month’s tooling.

Contract clause to encode. “Most agent shipped under this engagement includes structured trace instrumentation per request, queryable by ID, retained for at least 90 days, and accessible to the client engineering team. Trace coverage is verified by an automated check before the eval gate runs.”

Lesson 4: human-in-loop is product, not patch

Failure shape. Multiple 2025 engagements proposed agents that would operate fully autonomously, with the human-in-loop “added later if needed.” In most case, the human-in-loop ended up being needed in the first 30 days, and bolting it on after the agent had been designed for autonomy required a partial rewrite. The escalation paths were wrong, the audit trails were wrong, the UI was wrong, and the agent’s own confidence signals were not designed to surface uncertainty back to a human.

Lesson. Human-in-loop is a product surface, not a fallback. If a human is going to review even 5% of agent outputs, the agent must be designed from day 1 to expose its uncertainty, surface its trace, and accept human override as a first-class operation. “We’ll add review later” is the same anti-pattern as “we’ll add tests later.”

Contract clause to encode. “If the agent’s deployment surface includes human review, oversight, or override at any rate above 0% of requests, the human-in-loop product surface is a day-one deliverable, not a phase-2 addition.”

Lesson 5: model-upgrade testing is monthly cadence, not quarterly

Failure shape. Several 2025 engagements scheduled “quarterly model reviews” in the operating cadence, on the historical assumption that upstream model releases were rare events. By Q3 2025, that assumption was openly false: most major provider was shipping new versions monthly, with deprecations following 60–90 days later. Engagements on a quarterly cadence either fell behind on capabilities or were forced into emergency upgrades when the underlying model was deprecated mid-quarter.

Lesson. Model-upgrade testing is monthly, scheduled, and named on the engagement calendar. Each month, the team runs the eval suite against the candidate new model, files a written go/no-go decision, and either upgrades or documents why not. The quarterly cadence is dead.

Contract clause to encode. “The agency runs the standing eval suite against any newly released candidate model from the contracted providers within 14 days of release, and produces a written go/no-go recommendation to the client. The cadence is monthly minimum, ad-hoc on provider deprecation notices.”

Lesson 6: agent guardrails are infrastructure, not application code

Failure shape. Early 2025 engagements implemented guardrails; input validation, output filtering, prompt-injection defense, PII redaction; as application-layer code attached to the agent. By month 2, most agent had its own slightly different guardrail implementation, the test coverage was uneven, and a vulnerability in one agent’s guardrails was rediscovered in another’s three weeks later. The team was maintaining N agent-specific guardrail systems instead of one shared infrastructure layer.

Lesson. Guardrails are infrastructure: shared library, shared tests, shared deploy cadence, shared observability. Most agent on the platform inherits the guardrail layer. Application-specific extensions are allowed but must be additive, not replacement. This treats guardrails the way a serious backend engineer treats authentication; as a platform service, not a per-feature implementation.

Contract clause to encode. “Agents shipped under this engagement use the agency’s shared guardrail library for input validation, output filtering, prompt-injection defense, and PII handling. Application-specific extensions are documented in the agent’s ADR and reviewed against the shared library quarterly.”

Lesson 7: post-launch carries 30%+ of total project effort

Failure shape. The fixed-price engagements that priced the post-launch period at “10% of build effort” universally went over. The actual post-launch period; the first 90 days after the agent went live, covering incident response, eval drift, model upgrades, prompt revisions in response to real traffic; consistently consumed 30–40% of the total project effort. Agencies that under-priced this phase either ate the cost or had a client conversation that damaged the relationship.

Lesson. Price post-launch as 30% minimum of total project effort, structured as a defined operating retainer with named eval drift, model upgrade, and incident response responsibilities. Treat the build phase and the operate phase as distinct commercial structures, not as “phase 1” and “warranty.”

Contract clause to encode. “Post-launch operations are a separately scoped and priced phase, consuming a minimum of 30% of total project effort across the first 90 days. Operating responsibilities; eval maintenance, model upgrade testing, incident response, prompt revision; are named individually with SLAs.”

How the seven lessons compose

The seven lessons are not independent. Each reinforces the others, and an agency that internalizes only some of them tends to find that the un-internalized ones produce the failures that swamp the wins from the rest. A team with eval discipline (Lesson 1) but no cost cap (Lesson 2) ships a system that passes eval and dies in production. A team with traces (Lesson 3) but no human-in-loop design (Lesson 4) can debug failures it cannot prevent. A team with monthly model upgrades (Lesson 5) but per-agent guardrails (Lesson 6) chases regressions in N places instead of one. A team with many six but no post-launch reserve (Lesson 7) launches well and starves in month 3.

The composite is a particular operating shape. Eval-pre-architecture forces the spec to be written. Cost cap forces the spec to be commercially honest. Trace-first debugging forces the system to be observable. Human-in-loop as product forces the surface to be honest about uncertainty. Monthly model upgrades force the team to stay current. Shared guardrails force defenses to be uniform. Post-launch reserve forces the commercial structure to match the operational reality. This is the shape described in the AI agency operating system and lived out across the 2025 cohort.

What we wish we had done differently

Three meta-lessons sit on top of the seven structural ones, and they are worth naming.

We waited too long to publish post-mortems. The discipline of writing them happened in 2024; the discipline of publishing them; including the embarrassing ones; only took hold in mid-2025. Public post-mortems are an unreasonably effective hiring and selling tool, and we left value on the table by treating them as internal artifacts. The case for publishing is in why AI agencies should publish their post-mortems.

We should have written eval rubrics in the proposal phase. Several engagements signed contracts before either side had articulated what “good” meant, and the eval rubric got written under deadline pressure in week 1. A better pattern is to ship the eval rubric as part of the proposal; even a sketch; so the commercial conversation is already grounded in a measurable target.

We under-priced senior reviewer time on most engagement. Senior reviewer time is the gating resource on most lesson above: the eval design, the architecture review, the cost ceiling enforcement, the trace tooling decision, the model upgrade go/no-go. Pricing it like generic engineering hours under-charged it consistently. We have begun pricing senior reviewer hours at 2–3x junior rates with a named senior on most engagement, and the math finally works.

The composite frame for 2026

The 2025 cohort taught us that production agents survive when the agency treats the operating environment as a product rather than a deliverable. The artifacts that determine survival; the eval suite, the cost cap, the traces, the human-in-loop surface, the upgrade cadence, the guardrail library, the post-launch retainer; are not features of the agent; they are the operating system around the agent. An agency that builds the operating system once and replicates it across engagements compounds. An agency that re-improvises it each time treads water and burns goodwill on the agents that fail.

The 12-agent year is also a reminder that the work is younger than it feels. None of the lessons above were obvious in January 2025; many of them feel obvious now. The next 12 agents will produce another seven, and the agencies that publish them; failures included; will be the ones whose proposals close on the basis of trust rather than logo decks. We are betting on that side.


Arthur Wandzel is the founder of SFAI Labs. This retrospective is illustrative of the patterns observable across forward-deployed AI agency work in 2025; specific clients, dollar amounts, and incident details have been intentionally omitted in favor of structural lessons.

Frequently Asked Questions

What is the most important lesson from shipping production AI agents in 2025?

Write the eval suite before the architecture. Engagements that designed the architecture first and added evals later universally produced systems whose architecture was wrong for the eval that emerged once one was finally written down. The eval is the spec; the architecture is the implementation. Reversing the order produces an implementation of an unspecified spec, which is the technical definition of guesswork. Encode this as a contract clause requiring the eval rubric and at least 20 ground-truth cases to be delivered no later than day 5.

Why is cost-per-call a product feature rather than an infrastructure detail?

Because cost-per-call routinely binds tighter than accuracy or latency on the unit economics. Agents that worked technically but ran 3 to 8 times the cost the business model could absorb were discovered to be unviable only when the customer’s CFO did the math. The fix is to name the cost-per-call ceiling in the engagement charter alongside the accuracy threshold, and to evaluate most architectural decision against that ceiling. A model upgrade that improves accuracy 4% and increases cost 60% is a regression until the unit economics are renegotiated.

What does trace-first debugging mean for production agents?

It means most production agent ships with structured trace instrumentation from day 1: tool calls, retrieved context, model decisions, token counts, latencies, and costs, many attached to a request ID and queryable in under five minutes. Adding traces after the first incident is too late, because the team is then debugging the previous month’s traces with this month’s tooling. Trace-first debugging is not a ‘good engineering practice’; it is the only way to honor a same-day SLA on incident triage.

When should human-in-the-loop be designed into an AI agent?

From day 1, if the deployment surface includes any human review at any rate above 0% of requests. Agents designed for full autonomy and retrofitted with human review universally required a partial rewrite, because the escalation paths, audit trails, UI surface, and the agent’s own confidence signals were many wrong for the new requirement. Human-in-loop is a product surface, not a fallback; treating it as a phase-2 addition is the same anti-pattern as ‘we’ll add tests later.‘

How often should an AI agency test new model versions against an existing agent?

Monthly minimum, with ad-hoc runs whenever a provider issues a deprecation notice. By Q3 2025, most major provider was shipping new model versions monthly with deprecations following 60 to 90 days later. The historical ‘quarterly model review’ cadence is dead. The standing pattern is: each month the team runs the eval suite against any candidate new model, files a written go/no-go decision, and either upgrades or documents why not. The cadence is named in the operating contract, not left to discretion.

Should agent guardrails be implemented per-agent or as shared infrastructure?

As shared infrastructure. Per-agent guardrails; input validation, output filtering, prompt-injection defense, PII redaction implemented at the application layer; produce uneven coverage and rediscover vulnerabilities across agents. The fix is to treat guardrails the way a serious backend engineer treats authentication: a shared library with shared tests, deploy cadence, and observability. Most agent on the platform inherits the guardrail layer; application-specific extensions are additive and reviewed against the shared library.

How much of total project effort should be reserved for post-launch operations?

At least 30%, structured as a defined operating retainer with named eval drift, model upgrade, and incident response responsibilities. Engagements that priced post-launch at ‘10% of build effort’ uniformly went over: actual post-launch consumed 30 to 40% of total project effort across the first 90 days. The fix is to treat the build phase and the operate phase as distinct commercial structures with separately scoped pricing and named SLAs, not as ‘phase 1’ and ‘warranty’.

What contract clause encodes eval-pre-architecture discipline?

A clause that names the eval rubric and baseline suite as the first deliverable, due no later than day 5 of the engagement. Architectural decisions made before this deliverable are advisory; architectural decisions after it are binding. The clause separates the spec-writing phase from the architecture-writing phase, prevents the team from committing to architecture before the objective is written, and gives the client a contractual hook to reject premature architectural commitments.

Why publish post-mortems publicly instead of keeping them internal?

Because public post-mortems are unreasonably effective hiring and selling tools. The act of publishing; including the embarrassing failures; signals that the agency operates a real engineering culture rather than a marketing function. Agencies that published in 2025 reported faster senior hiring, faster proposal-to-close cycles with sophisticated buyers, and a defensible position when competitors made claims they could not substantiate. Internal-only post-mortems leave that value on the table.

What single change would have the biggest impact on production agent reliability for most agencies?

Adopt eval-pre-architecture as a hard rule. Most other lessons; cost caps, traces, human-in-loop, model upgrades, guardrails, post-launch retainer; are easier to install once an eval suite exists, because the eval suite anchors the conversation about what ‘good’ means. Without an eval, most other discipline is negotiated against vibes. With an eval, the discipline cascades. If an agency could only adopt one of the seven lessons, eval-first is the highest-leverage choice.

Last Updated: Jun 4, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles