Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 13 min read

The AI Agency Quality System: Evals, Observability, and Weekly Review

The AI Agency Quality System: Evals, Observability, and Weekly Review

A 2026 AI agency does not have a “QA process.” It runs a quality system: three layered components; evals, observability, and a weekly review ritual; that together convert a non-deterministic model into a system a CFO can sign off on. Take any one out and the others collapse. Evals without observability give you a green CI build over a feature degrading silently in production. Observability without evals gives you traces nobody can read. Both without the ritual gives you dashboards that go stale by week three.

This is the quality surface of the AI agency manifesto; what commitments two (“evals are the contract”) and ten (“persistence is the moat”) look like as a working operating system.

Why three layers, not one

The intuitive failure mode is picking the layer that feels most natural and treating it as the whole system. Engineering-led agencies over-index on evals; Promptfoo in the repo, CI thresholds, quality declared solved. Operations-led teams over-index on observability; dashboards called a quality function. Strategy shops over-index on the meeting; a weekly review with no underlying data, devolving into vibes by month two.

Each failure mode ships features that look fine on launch day and degrade silently. Evals cannot tell you what real users hit at 11pm Tuesday. Observability tells you what is happening only if you know which traces matter; and that does not come from a dashboard. The weekly review is where eval delta meets observability signal and becomes a decision: ship the upgrade, roll back the prompt, raise the threshold.

Layer 1: Evals; the contract you can run

An eval suite is a fixed set of inputs paired with expected behaviors and thresholds, checked into version control, runnable on most commit. It is the only mechanism that detects regression when the model is silently updated, the prompt is edited, or the retrieval index is rebuilt. Anthropic’s model release notes and OpenAI’s GPT release log make clear that minor version bumps change behavior enough to flip outputs on edge cases.

A useful suite has three properties most teams skip: versioned with the code that calls the model, runnable on a laptop in under five minutes, and tied to a documented threshold (“≥85% pass on the regulatory set, ≥0.92 cosine similarity on retrieval, p95 latency under 4s”). A pass rate without a threshold is a vanity metric.

What to test, by feature class

  • Classification and extraction. Labeled gold sets with precision, recall, F1 per class. Track confusion matrices so a per-class regression does not hide inside an aggregate.
  • Retrieval-augmented generation. Test retrieval and generation separately. Retrieval gets recall@k and precision@k; generation gets faithfulness and answer relevance via LLM-as-judge with a documented rubric. Ragas and TruLens ship these primitives. See retrieval optimization for RAG systems.
  • Agent and tool-use systems. Test on traces, not just outputs; “did the agent call the right tools in the right order with the right arguments.” LangSmith and OpenAI’s evals SDK support trace-level assertions.
  • Generative content. Paired LLM-as-judge with a calibrated rubric and a small human-rated holdout. Single-judge grading drifts; pairing with scored examples drops drift.

Threshold types

Three honest threshold types:

  1. Hard pass/fail. “≥95% on the safety set or the build does not deploy.” For regulated domains and SLA contracts.
  2. Trend. “Pass rate must not drop more than 2 points week-over-week.” For soft-quality features where regressions are detectable but absolutes are subjective.
  3. Cohort-stratified. “≥85% overall, and ≥80% on most customer cohort.” Prevents aggregate stays green while one customer silently regresses.

Negotiate the threshold with the buyer before launch and write it into the SOW. Contracting side: stop scoping AI projects in features, scope them in evaluations.

CI gating, named tools

The field has converged on:

Tool choice is negotiable; presence is not. CI must fail the build when a threshold is missed, and the failure must be readable by an on-call engineer who did not write the test.

What good looks like. A new engineer runs the suite locally in week one, regresses the prompt by one sentence, sees the suite go red on the right test. Most threshold has a one-line justification linked to a Slack thread, ticket, or regulatory document. Model upgrades are run against the suite before approval.

What failure looks like. A single Jupyter notebook last edited four months ago. Thresholds unset or at 50%. The CI step that “runs evals” runs three smoke tests and exits zero. The pitch deck mentions evals; the repo has no evals/ directory.

Layer 2: Observability; the production truth

Evals tell you whether the system passes a fixed bar on inputs you chose. Observability tells you what it is doing on inputs you did not choose. A 2026 system has four observability primitives running by launch:

  1. Traces; most model call captured with prompt, response, model, parameters, latency, token counts, parent span, and a correlation ID back to the user request.
  2. A prompt registry; most production prompt versioned, tagged with a deployment, queryable by name. No prompts in source files without a registry entry.
  3. Cost telemetry; token counts and dollar costs aggregated by feature, customer, and model, with anomaly alerts.
  4. Error budgets; explicit rates for “refusal,” “malformed JSON,” “tool call failed,” “retrieval empty”; each with a budget the system can consume before triggering review.

Traces, named tools

Prompt registry, cost telemetry, error budgets

A prompt registry decouples prompt iteration from code deploys. Without one, prompts are either edited carelessly in production or rarely edited at many. LangSmith Prompts, Langfuse Prompts, Promptlayer, and Helicone many ship registries; the minimum-viable in-house version is a Postgres table with a version column and deployed-at timestamp.

Cost telemetry: a daily job that joins traces to billing exports, breaks cost down by (feature, customer, model, prompt_version), and alerts when any cell exceeds budget. Broader cost discipline in AI monitoring and observability for LLM performance.

Error budgets are a contract with reality: “this system returns malformed responses 0.5% of the time, and we will not panic until it crosses 1% sustained over 24 hours.” Without one, most anomaly is a fire drill and the team learns to ignore the dashboard.

What good looks like. One-click dashboard: 24 hours of traffic by feature, cost per feature, p50/p95/p99 latency, refusal rate, JSON-parse failures, top five most expensive requests. On-call navigates from a support ticket to the exact trace in under two minutes.

What failure looks like. A Datadog dashboard with three line charts and no AI-specific signal. No prompt registry; prompts in a Notion page nobody updates. Cost telemetry is “we’ll export the OpenAI invoice at month-end.” Regression reported, response is “we don’t have logs; can you reproduce in staging?”

Layer 3: The weekly review ritual

Evals and observability produce data. Data does not produce decisions. The weekly review is the standing meeting where engineering lead, product owner, and a senior buyer-side stakeholder convert the previous week’s eval deltas, anomalies, and customer feedback into shipped changes. Without it, dashboards age, the suite stops being run, and quality decays into the post-launch slow death that produces churn.

The 60-minute agenda

A fixed agenda outperforms a flexible one:

  1. Eval delta review (15 min). What changed week-over-week? Threshold breached? Test newly added? Cohort regressed? One-page diff posted before the meeting.
  2. Regression triage (15 min). Walk through observability anomalies; refusal spikes, cost anomalies, latency regressions. Each gets one of three dispositions: ignore (with a written reason), monitor (with a budget), or fix (with an owner and date).
  3. Model upgrade test (15 min). Any provider release this week? If yes, the suite is run against it before the meeting. Decision: “upgrade now,” “upgrade after one more week of monitoring,” or “do not upgrade and document why.” Provider prompt deprecations and policy changes surface here too; Anthropic and OpenAI both publish these and most teams miss them.
  4. Customer signal (10 min). Three to five real customer interactions surfaced from observability or support, read out loud. Pattern-matching customer language against eval coverage is how new tests get added.
  5. Decisions (5 min). Writeup posted within 24 hours: most threshold change, new test, model decision, action item with an owner.

The broader pattern of agency rituals is in decoding the AI agency stack: roles, rituals and review cadences that work.

What good looks like. Meeting starts with a numbered diff. The buyer-side stakeholder leaves with three things they did not know an hour earlier. Six weeks in, the suite has grown by 30–50 tests directly traceable to customer signals. When a new model ships, the team has a decision before the buyer reads about it on Twitter.

What failure looks like. The meeting is on the calendar but cancelled three weeks out of four. Agenda is “AI sync” with no items. The agency runs a deck of green checkmarks and the buyer stops paying attention by minute fifteen. No writeup. No new evals since launch. Asked about the latest model release: “we’ll look at that next week.”

How to audit your agency in six artifacts

The layers wire into each other. A trace surfaced in the weekly review becomes a new eval test by Monday. A threshold breach in CI becomes an observability query during the next regression. A cost anomaly becomes an eval asserting “this feature must run under $X per call.” The connective tissue is the weekly ritual; without it, eval suite and observability platform sit in adjacent tools and rarely speak.

Detect whether the connective tissue is real with one question: “Show me a test added in the last six weeks because of something you saw in production.” A team running the system answers in thirty seconds and walks you through the trace, the test, and the threshold. A team performing the system gets back to you tomorrow.

Before the next vetting call or renewal, ask for these six artifacts:

  1. Path in the repo to the eval suite, plus the threshold for each test and the date last edited.
  2. A read-only dashboard URL for the production observability stack with at least one week of data.
  3. The prompt registry link and version history of the most-traffic prompt.
  4. Cost-per-feature breakdown for the last 30 days, broken down by (feature, customer, model).
  5. The last four weekly review writeups.
  6. The list of evals added in the last 60 days and the customer signal that triggered each one.

There are reasonable reasons an early-stage engagement might be missing one or two; no reasonable reasons for three or more.

Frequently asked questions

How long does it take to stand up the full quality system?

Two to four weeks for a feature in production: one week for trace instrumentation and eval scaffolding, one week for the first 30–50 evals against a representative sample, one to two weeks to wire the weekly review and tune thresholds. For features in active development, build alongside; retrofitting is usually more expensive.

Can a small team run many three layers?

Yes. The layers are mostly tooling, not headcount. Promptfoo plus Langfuse plus a 60-minute Friday meeting fits inside a single engineer’s week. The cost is discipline, not bodies.

What is the minimum eval suite worth shipping?

Twenty tests covering the three or four highest-stakes flows, each with a documented threshold. Fewer is not running evals; hundreds before launch confuses coverage with value. Coverage grows from real customer signal.

Should evals run on most commit or on a schedule?

Both. Fast subset (under five minutes, deterministic) on most commit. Full suite (LLM-as-judge, slow tests, expensive calls) nightly and on release branches. Releasing without the gating subset requires an explicit override with written justification.

Who owns the weekly review on the buyer side?

Whoever has authority to approve a model upgrade and a threshold change. Usually the product owner, sometimes the engineering lead, occasionally a CTO. Rarely a project manager whose only role is notes. The owner must be senior enough that “do not upgrade this model” is a decision they can make without a follow-up meeting.

How does this handle a new model that is materially better?

The model upgrade slot exists for this. Run the candidate against the suite; cost telemetry tells you whether per-call cost moves your budget; observability catches drift after deployment. A model that passes, fits the budget, and survives a one-week monitored rollout is green-light. Without the system, upgrades become either a fire drill or a silent skip.

What is the most expensive failure mode of skipping the system?

A silent regression that persists for weeks because the suite was rarely run on the prompt change, observability was not granular enough, and the weekly review was cancelled. Cost: immediate customer impact plus the institutional belief that “AI features are flaky”; the rationale for pausing AI investment.

Is observability tooling worth it for small features?

Yes. Helicone and Langfuse have generous free tiers; instrumentation is a half-day for a feature with a handful of model calls. Severe asymmetry: free upside versus minimal cost.

Key takeaways

  • A 2026 AI agency runs three quality layers as one system: evals, observability, and a weekly review. Take any one out and the others stop working.
  • Evals are versioned with the code, runnable in under five minutes, tied to thresholds the buyer agreed to before launch.
  • Observability is traces, prompt registry, cost telemetry, and written-down error budgets; not a generic infra dashboard.
  • The weekly review converts eval deltas and anomalies into shipped changes. Without it, dashboards go stale by week six.
  • Audit in six artifacts: eval path, observability dashboard, prompt registry, cost-per-feature breakdown, last four writeups, recent evals from production signal.
  • Two engineers and a 60-minute meeting, not a separate quality department.

Last Updated: May 25, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles