A studio that ships RAG without a named, defended reference architecture is reinventing seven layers of plumbing on most engagement. Chunking, embedding model, vector store, retrieval pattern, reranker, eval suite; each has a defensible default and three credible alternatives, and re-debating them per project is the difference between an engagement that ships in week six and one that ships in week twelve.
This is the RAG-specific spoke of the AI agency manifesto, one level deeper than the studio’s general reference architecture. Where the general one says use evals, this one says which five metrics, on which Ragas configuration, gated by which CI check.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Why RAG needs its own reference architecture
The general reference architecture names engagement-wide defaults; model router, eval framework, observability, deploy platform. It does not name the chunking strategy, the embedding model, or the reranker. Those decisions sit inside the studio’s rag-starter package and determine whether the RAG system answers correctly. A studio that treats most RAG engagement as greenfield re-debates them in week one, ships in week four, and discovers in week six that the chunk size is wrong or the reranker was skipped.
The deeper engineering case is in RAG architecture design for an agency, the RAG implementation process guide, and retrieval optimization for RAG systems. The summary:
| Layer | Default | Deviation triggers |
|---|---|---|
| Chunking | Hierarchical with parent-document linking | Transcripts → sliding window. Long-context semantic units → late chunking. |
| Embedding | OpenAI text-embedding-3-large | Finance/legal/code → Voyage. Data residency → mxbai or BGE. |
| Vector store | pgvector on Postgres | >10M vectors or hard tail-latency targets → Pinecone or Turbopuffer. |
| Retrieval | Hybrid (dense + BM25) with reciprocal rank fusion | Pure prose corpora may run dense-only with a reranker. |
| Reranker | Cohere rerank-3 | Voyage corpora → Voyage rerank-2. Air-gapped → bge-reranker-v2-m3. |
| Eval suite | recall@10, recall@3 post-rerank, MRR, faithfulness, answer relevance | Domain-specific metrics added per ADR. |
Layer 1: Chunking strategy
The default is hierarchical chunking with parent-document linking. Documents split into 200–400-token leaf chunks for embedding, linked to a 1,500–3,000-token parent passage the LLM reads. The leaf is the unit of similarity; the parent is the unit of context. Beats flat fixed-size chunking by five to twelve points of recall@10 on internal benchmarks, plus a generation-quality lift because the LLM gets a coherent passage rather than a 400-token fragment with a dangling pronoun.
Three named alternatives, each chosen via ADR:
- Sliding-window chunking for transcripts, dialogue, and meetings; the signal lives across speaker turns, and a fixed-stride window with 50–100 tokens of overlap captures it cleanly.
- Late chunking; embedding the full document with a long-context embedder (
jina-embeddings-v3,voyage-3-large) and chunking the token-level vectors after the fact; when document boundaries carry semantic weight token-level splits would destroy. Legal contracts, regulatory filings, academic papers. - Naive 512-token fixed-size chunking is what most “we tried RAG and it didn’t work” engagement shipped. A studio that lists it as default in 2026 has not run the comparison.
Layer 2: Embedding model
The default is OpenAI text-embedding-3-large at 3,072 dimensions for general-purpose English RAG; not because it usually wins but because it loses least often.
- Voyage
voyage-large-2(or its 3-class successor) for finance, legal, and code. Internal numbers consistently show a five to ten percent recall@10 lift over the OpenAI default. Also the default whenever the engagement plans to use Voyage’s reranker. - mxbai
mxbai-embed-large-v1for data-residency or no-vendor-egress constraints. Runs on a single A10 or L4 GPU. BGE (bge-large-en-v1.5, M3) is the close alternative. - Cohere
embed-english-v3when the engagement is already on Cohere for reranking.
The choice is recorded in an ADR with recall@10 and MRR numbers from a 200-query golden set on the buyer’s domain. A studio picking an embedding model without that comparison is picking it on vibes.
Layer 3: Vector store
The default is pgvector on Postgres for the first ten million vectors. The argument is operational: pgvector keeps embeddings in the same database the application already uses, eliminates two-system consistency bugs, halves the on-call surface, and runs on the existing RDS or Supabase instance. With a tuned IVFFlat or HNSW index and a reranker in front, the recall difference versus a dedicated vector database at sub-ten-million scale disappears under reranker noise.
- Pinecone for production multi-tenant scale with namespace isolation and sub-100ms p99 targets. Pinecone Serverless has shifted the cost story.
- Weaviate or Qdrant when hybrid retrieval and rich filtering need to live in the vector store. Both also work self-hosted when egress is forbidden.
A studio that defaults to Pinecone or Weaviate on day one without a scale, latency, or filtering justification is making the decision before the cost-of-systems argument has been weighed.
Layer 4: Retrieval orchestration
The default is hybrid retrieval; dense vector search plus sparse BM25; fused via reciprocal rank fusion with k = 60. Dense gets the semantic matches; BM25 gets the exact-token matches the embedder cannot understand; RRF merges the two ranked lists without score calibration.
Hybrid is required, not optional, whenever the corpus contains identifiers the embedder cannot reason about: SKUs, regulation numbers (CFR 14.91), drug codes (NDC 0078-0357-15), error codes, internal ticket IDs. It is also the right default for technical documentation, internal knowledge bases, and code search. Pure-prose RAG (policy documents, customer FAQs) can sometimes run dense-only with a strong reranker; the decision belongs in an ADR with a 200-query benchmark on each side.
Layer 5: Reranker
A reranker is non-negotiable in a 2026 RAG stack. Pulling the top 50 candidates from retrieval, reranking them, and keeping the top three to five for the prompt is what turns “the right chunk is in the top 50” into “the right chunk is in the prompt.”
The default is Cohere rerank-3; ten to twenty points of nDCG@10 lift over embedding-only retrieval on internal benchmarks, most of the gain on adversarial queries. Voyage rerank-2 on finance and legal corpora when the embedder was Voyage. bge-reranker-v2-m3 for air-gapped engagements. A studio that ships RAG without a reranker is shipping recall and calling it precision.
Layer 6: Eval suite
Five metrics, two layers, one threshold file.
Retrieval-layer metrics are computed against a golden set of (query, relevant_chunk_ids) pairs; typically 200 queries built in week one and grown to 500–1,000 by ship.
- recall@10; did the right chunk get retrieved before reranking. Leading indicator of an embedding-or-chunking problem.
- recall@3 (post-rerank); did the right chunk get into the prompt. Leading indicator of a reranker problem if recall@10 is healthy.
- MRR; where in the ranking did the right chunk land.
Generation-layer metrics are computed via Ragas against the LLM’s actual output.
- faithfulness; did the answer stay grounded in the retrieved context. The single most important quality metric in production RAG.
- answer relevance; did the answer address the question.
Retrieval-layer metrics run on an in-house evalkit harness; generation-layer metrics run on Ragas. Both pipe results into Langfuse. Each metric has a threshold in evals/thresholds.yaml, gated by the studio-standard evals-required GitHub Actions check. A PR that drops recall@10 or faithfulness below threshold cannot be merged. Domain-specific metrics; citation accuracy on regulated content, structured-extraction F1 on financial filings, code-execution success on code-search RAG; are added per engagement via ADR.
Where to deviate from the defaults
Three places consistently justify deviation, each recorded as an ADR:
- Regulated data residency. No-vendor-egress engagements force open-weights embeddings (
mxbaior BGE) and self-hosted reranking (bge-reranker-v2-m3). The eval suite does not change; only the model choices do. - Scale. Above ten million vectors with strict tail-latency targets, pgvector loses to Pinecone or Turbopuffer on operational simplicity, not recall.
- Domain. Finance, legal, biomedical, and code each have an embedding model that beats the general default by a margin worth the migration cost. The deviation is justified with a 200-query benchmark, not a blog post.
Outside these three, deviation is usually engineer preference dressed as architecture.
What a buyer can verify
A RAG reference architecture is a thing on disk, not a thing on a slide. Five questions, ten minutes, screen-shared.
- “Show me your
rag-starterrepo and its README.” A real studio has it open in 30 seconds, with chunking, embedding model, vector store, and reranker visible as configurable parameters. - “Which embedding model did you pick on your last RAG engagement, and what was the recall@10 number on a held-out set?” One sentence with two numbers.
- “Pull up the eval threshold file and the most recent Ragas faithfulness number.” If
evals/thresholds.yamldoes not exist, the eval suite is theatre. - “What ADR justified the chunking strategy?” A real studio has it in the repo.
- “What does the reranker contribute over embedding-only retrieval?” A number; typically ten to twenty points of nDCG@10; and a Langfuse link.
If the studio cannot pass these in ten minutes, the RAG architecture is being narrated, not lived. Deeper specifics in retrieval optimization for RAG systems and the RAG implementation process guide. Standardizing this layer is what makes RAG engagements ship in week six instead of week twelve.
Frequently asked questions
What is a RAG reference architecture?
The studio’s named defaults across the seven layers of a RAG system: ingestion, chunking, embedding, vector store, retrieval, reranking, and evaluation. It sits one level deeper than a general AI agency reference architecture; the general one names the eval framework and model router; the RAG-specific one names the chunking strategy (hierarchical with parent-document linking), the embedding model (text-embedding-3-large default, voyage-large for finance and legal), the vector store (pgvector for the first ten million vectors, Pinecone for multi-tenant scale), the reranker (Cohere rerank-3), and the eval suite (recall@k, MRR, faithfulness on Ragas). It is what an engagement consumes from the shelf on day one.
Which chunking strategy should be the default?
Hierarchical chunking with parent-document linking. Documents split into 200–400-token leaf chunks for embedding and retrieval, linked to a 1,500–3,000-token parent passage that the LLM reads. Beats flat fixed-size chunking on most internal benchmark. Sliding-window is the alternative for transcripts and dialogue. Late chunking; embedding the full document with a long-context embedder, then chunking the token-level vectors; fits when document boundaries carry semantic weight that token-level splits destroy. Naive 512-token fixed-size chunking is a yellow flag in any architecture review.
What embedding model should an AI agency standardize on in 2026?
OpenAI text-embedding-3-large for general-purpose English RAG. Voyage voyage-large-2 (or its 3-class successor) for finance, legal, and code, where benchmarks consistently show a five to ten percent recall lift. mxbai-embed-large-v1 for engagements with data-residency or no-vendor-egress constraints. Cohere embed-english-v3 when the engagement is already on Cohere for reranking. The choice is recorded in an ADR with recall@10 and MRR numbers from a 200-query golden set on the buyer’s domain.
What vector store should an AI agency default to? Pgvector on Postgres for the first ten million vectors. Pinecone for production multi-tenant scale. Weaviate or Qdrant when hybrid retrieval and rich filtering need to live in the vector store rather than be pre-filtered in Postgres. The default is pgvector because it keeps embeddings in the same database the application already uses, eliminates two-system consistency bugs, and the recall difference at sub-ten-million scale disappears under reranker noise. A buyer who hears Pinecone or Weaviate as a day-one default without a scale or filtering justification should ask why.
Why is a reranker non-negotiable in a 2026 RAG stack?
A reranker turns retrieved-and-noisy into retrieved-and-clean. Bi-encoder retrieval is fast but coarse; cross-encoder reranking is slow but precise. The default is Cohere rerank-3, with Voyage rerank-2 on finance and legal corpora where the embedder is Voyage. The contribution is consistent: ten to twenty points of nDCG@10 lift on internal benchmarks, most of it on adversarial queries an embedding-only stack misses. A RAG architecture without a reranker is shipping recall and calling it precision.
What does hybrid retrieval mean and when is it required?
Dense vector search combined with sparse keyword search (BM25 or SPLADE), fused via reciprocal rank fusion. It is required whenever the corpus contains identifiers the embedder cannot reason about; SKUs, regulation numbers, drug codes, error codes, ticket IDs; because dense retrieval consistently misses on those. Also the right default for technical documentation and code search. For pure prose (policy documents, customer FAQs), dense-only with a strong reranker is often sufficient; but the decision belongs in an ADR with a 200-query benchmark on each side.
What eval metrics should a RAG suite include?
Five, in two layers. Retrieval-layer: recall@10 (did the right chunk get retrieved), recall@3 after reranking (did it land in the prompt), and MRR (where in the ranking it landed). Generation-layer: faithfulness (did the answer stay grounded in retrieved context) and answer relevance (did the answer address the question). Ragas runs the generation-layer metrics; an in-house evalkit harness runs the retrieval-layer metrics against a golden Q-and-relevant-chunk-id set. Each metric has a threshold in evals/thresholds.yaml, gated by the evals-required CI check.
Where should an agency deviate from its default RAG stack?
Three places, each recorded as an ADR. Regulated data residency forces open-weights embeddings (mxbai or BGE) and self-hosted reranking. Scale above ten million vectors with strict tail-latency targets pushes off pgvector toward Pinecone or Turbopuffer. Domain; finance, legal, biomedical, code; each has an embedding model that beats the general default by a margin worth the migration cost. Outside these three, deviation is usually engineer preference dressed as architecture.
How does this relate to the general AI agency reference architecture?
The general one names model router, eval framework, observability, deploy platform, and internal packages. The RAG-specific architecture sits inside the rag-starter package; chunking is parameterized, the embedding model is configurable, the vector store is pgvector by default with a Pinecone adapter, the reranker is Cohere, and the eval suite ships pre-wired to Langfuse. The general one is a precondition for the RAG one. See RAG architecture design for an agency for the deeper engineering case.
How can a buyer verify a vendor has a real RAG reference architecture?
Five questions, ten minutes. Ask to see the rag-starter repo with chunking, embedding model, vector store, and reranker as configurable parameters. Ask which embedding model they picked on the last engagement and the recall@10 on a held-out set. Ask for the eval threshold file and the most recent Ragas faithfulness number. Ask for the ADR that justified the chunking strategy. Ask what the reranker contributes in nDCG@10 over embedding-only retrieval. A real studio answers many five with screen-shared evidence.
Arthur Wandzel