The AI Agency Reference Architecture for Agent-Heavy Engagements

A studio that ships agent-heavy systems without a named, defended reference architecture is reinventing six layers of plumbing on most engagement. Agent loop, tool registry, memory, eval suite, guardrails, trace-first observability; each has a defensible default and two or three credible alternatives, and re-debating them per project is the difference between shipping in week eight and getting cancelled in week sixteen.

This is the agent-specific spoke of the AI agency reference architecture for tools, templates, and rituals. It is the companion to the RAG-heavy reference architecture. Where the RAG version names a chunking strategy and a reranker, this one names the loop pattern, the tool runtime, and the trajectory-eval gate.

Decision Scope

This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.

Why agents need their own reference architecture

The general reference architecture names engagement-wide defaults; model router, eval framework, observability, deploy platform, internal packages. It does not name the agent loop, the tool registry shape, or how trajectories are evaluated. Those decisions sit inside the studio’s agent-starter package and determine whether the agent does the right thing reliably enough to put in front of a user. A studio treating most engagement as greenfield re-debates them in week one, ships a demo in week four, and discovers in week eight that the loop was wrong, the tool registry has thirty entries the model cannot disambiguate, and there are no trajectory evals to gate the next deploy.

The deeper engineering case is in the OpenClaw AI agent framework and stop scoping AI projects in features; scope them in evaluations. The summary:

Layer	Default	Deviation triggers
Agent loop	ReAct (single-agent, ≤5 tools)	Plan-execute for long-horizon. Hierarchical only for genuine multi-domain.
Tool registry	Typed JSON Schema, in-process function calls; MCP for cross-process	Voice/realtime → streaming-first runtime.
Memory	Short-context only	Episodic per-user via ADR. Semantic → almost usually RAG, not memory.
Eval suite	Trajectory + tool-call + end-state on Inspect; Promptfoo for prompts	Voice/realtime → barge-in and partial-output scoring.
Guardrails	Input filter, output verifier, kill switch, human-in-the-loop on irreversibles	Air-gapped → self-hosted classifier; otherwise no deviation.
Observability	Langfuse + OpenTelemetry GenAI; trace-first replay	Self-hosted Langfuse for regulated.

Layer 1: The agent loop

The default is ReAct; reason, act, observe; for single-agent engagements with five or fewer tools and a horizon the model can hold in working memory. ReAct debugs cleanly under a trace and the failure modes are named (loop-on-tool, premature-termination, over-tooling). LangGraph and the OpenAI Agents SDK both ship a ReAct primitive; the studio uses LangGraph for graph-shaped control flow and the Agents SDK for single-provider engagements. Anthropic SDK + tools is the right primitive when the engagement is Anthropic-only and the tool calls need the model’s strongest reasoning behavior.

Two named alternatives, each chosen via ADR:

Plan-execute for long-horizon tasks where the agent needs an explicit, revisable plan; multi-step research, code generation across files, structured extraction over a hundred-page document. The plan is a first-class artifact the eval suite can score independently of the execution trace.
Hierarchical multi-agent; planner-orchestrator routing to specialist sub-agents; only for engagements that genuinely span multiple domains. The trap is choosing this on day one because it sounds sophisticated; the cost is an eval suite that scores routing, sub-agent trajectories, and inter-agent handoffs separately.

The choice is recorded in an ADR with a 50-trajectory benchmark on the buyer’s task before the loop is locked.

Layer 2: The tool registry

A tool registry is not a list of functions. It is a typed, validated, version-controlled contract surface the agent and runtime both consume. Three rules.

Most tool has a typed JSON Schema for inputs and outputs, validated on both sides; a schema failure is a tool-call error in the trace, not a silent coercion. The Anthropic and OpenAI Agents SDKs both enforce input schemas; the studio default extends this to outputs, because an unvalidated tool result is the most common source of downstream agent confusion.

In-process tools are typed function calls; cross-process tools live behind MCP servers. The Model Context Protocol; Anthropic’s open spec, now broadly adopted; is the studio default for any tool reused across agents, runtimes, or providers. An MCP-wrapped tool is written once and consumed by an OpenAI, Anthropic, or LangGraph agent without reimplementation.

The registry caps tool count per agent at twelve to fifteen. Beyond that, tool-selection accuracy collapses on most benchmark we run. The right pattern above the cap is hierarchical decomposition or retrieval over the tool catalog.

Descriptions are treated as prompts and version-controlled; a sloppy description is the most common preventable agent failure.

Layer 3: Memory; and what to skip

The default is short-context memory only; the message history within the active task. The agent is stateless across sessions unless an ADR has named a specific failure mode that statelessness cannot address.

Episodic memory; cross-session recall; is added per ADR when the agent must remember user-specific state (a support agent recalling account context, a coding agent recalling project conventions). The store is read-mostly and write-gated, rarely a free-form scratchpad.
Semantic memory; learned facts the agent retrieves; is almost usually better implemented as RAG against a controlled corpus than as a memory store the agent writes into. Once the agent writes its own long-term memory, it pollutes the store with hallucinations and the eval suite cannot distinguish legitimate retrieval from a self-generated artifact.

The discipline: default to stateless; add tiers only when statelessness has a named failure mode and a metric that improves under the new tier. “The agent should remember things” is not a requirement; “the agent fails task X 30% of the time because it cannot recall fact Y” is.

Layer 4: Evals for non-determinism

The hardest problem in agent engineering is that the same input produces a different output run-to-run, and the eval suite has to gate deploys anyway. Three layers, each with its own metric, each running multiple times per case.

Trajectory eval scores the reasoning-and-tool-call sequence against a golden trajectory or rubric. Did the agent take a sensible path? Inspect is the default harness. Leading indicator of a loop or tool-description problem.
Tool-call eval scores each invocation: right tool, well-formed arguments, successful call. Falls fastest when a tool description gets edited carelessly.
End-state eval scores the final output against a rubric or deterministic ground-truth check (the database row exists, the file compiles, the email matches expected fields). Highest-signal metric when a deterministic check exists.

Promptfoo runs prompt-level regression. Langfuse stores most trace and eval result. Non-determinism is handled by running each case three to five times and gating on a pass rate; typically 80–95%; not a single run. Thresholds live in evals/thresholds.yaml gated by the evals-required CI check; a PR that drops trajectory pass-rate or end-state below threshold cannot be merged.

Layer 5: Guardrails

Four guardrails are non-negotiable. Anything less is shipping a liability.

Input filtering at the perimeter. Prompt-injection and PII detection on most user message and most tool output that re-enters context. A web-fetch tool’s output is as dangerous as a user message and gets the same treatment.
Output verification on side-effecting tool calls. No agent writes, emails, posts, or moves money without a verifier check that the call matches inferred intent. The verifier is a small, fast model or a deterministic rule; not the same agent that produced the call.
A kill switch. Rate limit per task and hard cap per session, agent forced to halt and surface a summary when either fires. This is what stops a runaway loop from becoming a billing or database incident.
Human-in-the-loop on irreversible actions; external email, deleted records, executed transactions, merged pull requests. Irreversible means human-confirmed by default; deviation is recorded per ADR with explicit risk acknowledgement.

Layer 6: Trace-first observability

Most agent run produces a single trace capturing the full reasoning chain, most tool call with arguments and results, most model call with tokens and latency, and most guardrail decision. Langfuse is the studio default; OpenTelemetry GenAI semantic conventions are emitted in parallel so traces are portable.

Trace-first means debugging starts from the trace, not the logs. When an agent misbehaves, the engineer pulls the trace in Langfuse, replays it deterministically with the same tool stubs and seed, and steps through the failing turn. The replay capability is what separates an agent practice from an agent demo.

Where to deviate from the defaults

Three places consistently justify deviation, each recorded as an ADR.

Regulated or air-gapped engagements force open-weights models, self-hosted MCP servers, and self-hosted Langfuse. Only the deployment surface changes.
Real-time and voice agents replace the text loop with a streaming-first runtime (Anthropic streaming tools, OpenAI Realtime) and rewrite the eval suite to score partial outputs and barge-in.
Genuinely multi-domain engagements replace the single-agent loop with planner-plus-specialists. The bar is high; most engagements that look multi-domain are one agent with too many tools.

Outside these three, deviation is framework preference dressed as architecture.

What a buyer can verify

An agent reference architecture is a thing on disk, not a thing on a slide. Five questions, ten minutes, screen-shared.

“Show me your agent-starter repo and its README.” Loop pattern, tool registry, memory tier, eval suite, guardrails; visible as configurable parameters in 30 seconds.
“Which loop pattern did you choose on your last agent engagement, and the trajectory-eval pass rate on a held-out set?” One sentence with two numbers and a Langfuse link.
“Show me the tool registry.” Most tool with a typed JSON Schema and a description treated as a prompt. A list of Python functions with one-line docstrings is not a registry.
“Pull up the most recent Langfuse trace and step through it.” The engineer points to the failing turn, the model call, and the guardrail decision in under two minutes.
“What is the kill-switch threshold, and where is it enforced?” A number and a file path, not a philosophy.

If a studio cannot pass these in ten minutes, the agent architecture is narrated, not lived. Standardizing this layer is what makes agent engagements ship in week eight instead of week sixteen; and the difference between an agent that goes in front of users and one that quietly gets shelved.

Frequently asked questions

What is an agent reference architecture?

The studio’s named defaults across six layers: agent loop, tool registry, memory model, eval suite for non-determinism, guardrail stack, and trace-first observability. It sits one level deeper than a general AI agency reference architecture; the general one names the eval framework and model router; the agent-specific one names the loop pattern, the tool runtime (typed JSON Schema, MCP for cross-process), memory tiering, and the eval suite (trajectory, tool-call, end-state on Inspect).

Which agent loop pattern should be the default in 2026?

ReAct for single-agent engagements with ≤5 tools and predictable horizons. Plan-execute for long-horizon tasks needing an explicit revisable plan; research, multi-step extraction, code generation across files. Hierarchical only for genuine multi-domain engagements where a single prompt cannot hold most tool. Recorded in an ADR with a 50-trajectory benchmark. Defaulting to hierarchical on most engagement buys complexity the eval suite cannot pay for.

How should an AI agency design the tool registry?

Three rules. Most tool has a typed JSON Schema for inputs and outputs, validated on both sides; schema failure is a tool-call error, not silent coercion. In-process tools are typed function calls; cross-process tools live behind MCP servers, reused across agents and providers without reimplementation. The registry caps tool count per agent at twelve to fifteen; beyond that, tool-selection accuracy collapses, and the right pattern is hierarchical decomposition or retrieval over the catalog. Descriptions are treated as prompts and version-controlled.

What memory should an agent have by default?

Short-context only. Episodic memory (cross-session per-user state) is added per ADR when statelessness has a named failure mode. Semantic memory is almost usually better implemented as RAG against a controlled corpus than as a memory store the agent writes into. Default to stateless; add tiers only when a metric improves under the new tier.

How do you evaluate an agent when the output is non-deterministic?

Three eval layers. Trajectory eval scores the reasoning-and-tool-call sequence. Tool-call eval scores each invocation; right tool, well-formed arguments, successful call. End-state eval scores the final output against a rubric or deterministic check. Inspect runs trajectory and end-state, Promptfoo runs prompt-level regression, Langfuse stores traces. Non-determinism is handled by running each case three to five times and gating on a pass rate (80–95%), not a single run. Each metric has a threshold in evals/thresholds.yaml gated by evals-required.

What guardrails are non-negotiable in a 2026 agent stack?

Four. Input filtering; prompt-injection and PII detection on most user message and most tool output that re-enters context. Output verification on any tool call with side effects. A kill switch; rate limit per task and hard cap per session, agent forced to halt when either fires. And a human-in-the-loop checkpoint on irreversible actions. Anything less is shipping a liability.

What does trace-first observability mean for agents?

Most run produces a single trace capturing the full reasoning chain, most tool and model call, and most guardrail decision. Langfuse is the studio default; OpenTelemetry GenAI conventions are emitted in parallel so traces are portable. Trace-first means debugging starts from the trace; pull it, replay it deterministically with the same tool stubs and seed, step through the failing turn. A studio debugging agents from print statements has not earned the right to call its work production.

Where should an AI agency deviate from its default agent stack?

Three places, each per ADR. Regulated or air-gapped engagements force open-weights models, self-hosted MCP, and self-hosted Langfuse. Real-time and voice agents replace the text loop with a streaming-first runtime (Anthropic streaming, OpenAI Realtime) and rewrite the eval suite. Genuinely multi-domain engagements replace single-agent with planner-plus-specialists. Outside these three, deviation is framework preference dressed as architecture.

How does this relate to the general AI agency reference architecture?

The general one names model router, eval framework, observability, and deploy platform. The agent-specific architecture sits inside the agent-starter package on that shelf; loop parameterized, tool registry configurable with MCP adapters, memory opt-in via ADR, eval suite pre-wired to Langfuse, guardrails on by default. The companion RAG-heavy reference architecture handles retrieval-heavy engagements; this one handles tool-heavy ones.

How can a buyer verify a vendor has a real agent reference architecture?

Five questions, ten minutes. Ask to see the agent-starter repo with loop, tool registry, memory tier, evals, and guardrails as configurable parameters. Ask which loop they picked on the last engagement and the trajectory-eval pass rate. Ask to see the tool registry; typed JSON Schema, descriptions treated as prompts. Ask for the most recent Langfuse trace and watch the engineer step through it. Ask the kill-switch threshold; a number and a file path.

The AI Agency Reference Architecture for Agent-Heavy Engagements

Decision Scope

Why agents need their own reference architecture

Layer 1: The agent loop

Layer 2: The tool registry

Layer 3: Memory; and what to skip

Layer 4: Evals for non-determinism

Layer 5: Guardrails

Layer 6: Trace-first observability

Where to deviate from the defaults

What a buyer can verify

Frequently asked questions

What is an agent reference architecture?

Which agent loop pattern should be the default in 2026?

How should an AI agency design the tool registry?

What memory should an agent have by default?

How do you evaluate an agent when the output is non-deterministic?

What guardrails are non-negotiable in a 2026 agent stack?

What does trace-first observability mean for agents?

Where should an AI agency deviate from its default agent stack?

How does this relate to the general AI agency reference architecture?

How can a buyer verify a vendor has a real agent reference architecture?

See how companies like yours are using AI

Related articles

The 10x Developer Used to Be a Unicorn — Now We're Approaching the 1000x Paradigm

A field guide to evaluating an AI agency in under 90 minutes

Agentic AI Development: Tool Use and Function Calling

Where ideas become AI products

Company

General

Case Studies

Services

Resources