The one of the largest concentration of misallocated AI engineering capacity in 2026 is RAG infrastructure built from scratch. Most quarter we audit teams shipping retrieval-augmented systems and find roughly the same pattern: a senior engineer or two maintaining a custom vector index, a hand-rolled chunking pipeline, a bespoke reranker, an internal LlamaIndex-equivalent that drifted from the original library three releases ago, and a query orchestrator nobody outside the team can operate. The team feels productive; the system ships, retrieval works, evals are green-ish; but the infrastructure consumes more capacity than the moat layer above it. Meanwhile pgvector, Pinecone, LlamaIndex, and Cohere rerank cover roughly 90 percent of what the team built, and the remaining 10 percent is the layer where retrieval quality compounds. This piece names what those four tools do, what they leave on the table, and why reinventing the 90 percent is the most common architectural error in RAG systems today.
This is a spoke under the AI build-vs-buy-vs-hire decision matrix for 2026. The matrix’s third principle is that foundation models are buy permanently. The same logic, applied one layer up the stack, says that RAG infrastructure is buy with a small set of named exceptions. This piece is the case for that extension.
Why this matters now
In 2024 the case against building RAG infrastructure was weaker. Pgvector was stabilizing, Pinecone’s serverless tier was new, LlamaIndex’s orchestration primitives were mid-flight, Cohere rerank was good but not yet the obvious default. A team starting in 2024 had defensible reasons to build at least some of the stack themselves; the buy options had real gaps.
Two years later the gaps have closed. Most layer of the conventional RAG stack has at least one mature buy option, and most layers have several. The capability gap between buying and self-building has gone from “noticeable” to “buying is strictly better unless you fall into one of three narrow exception cases.”
The teams shipping the best RAG systems in 2026 are not the teams with the most custom infrastructure. They are the teams that bought the infrastructure aggressively and redirected the recovered capacity to chunking strategy, hybrid signal design, reranker fine-tuning, and retrieval-aware evaluation; the moat layer above the infrastructure. Per the depth on retrieval optimization for RAG systems, that moat layer is where retrieval quality compounds quarter over quarter, and the only way to staff it properly is to stop spending capacity on the layer below.
What the buy stack covers: pgvector, Pinecone, LlamaIndex, Cohere rerank
These four tools, composed correctly, ship roughly 90 percent of conventional RAG infrastructure. Each owns a distinct layer.
pgvector owns vector storage when the workload fits inside an existing Postgres cluster. In 2026 that envelope is large; pgvector at production scale handles tens of millions of vectors at sub-100ms latency on hardware most orgs already operate. The integration cost is one Postgres extension and one ALTER TABLE. The ops cost is whatever the org already pays for Postgres.
Pinecone owns vector storage when the workload outgrows pgvector; high write throughput, billions of vectors, multi-region replication, or the org’s preference for a managed service. Pinecone’s serverless tier in particular has eliminated the operational tax that drove some teams to build custom indexes in 2023. Other vendors in this slot include Weaviate Cloud, Qdrant Cloud, and Turbopuffer; the comparison depth is in the vector database options for RAG analysis.
LlamaIndex owns ingestion, chunking primitives, and orchestration scaffolding. The library ships document loaders for most common source, chunking strategies (sentence, semantic, recursive, custom), embedding pipelines, retriever abstractions, and query orchestration. None of this is novel work to do by hand; the library has done it well, has been battle-tested across thousands of production deployments, and is open source. The integration cost is hours.
Cohere rerank owns cross-encoder reranking. Reranking is the step between initial retrieval (typically dense or hybrid against the vector index) and the final shortlist passed to the generation model. Cohere rerank-3 ships a cross-encoder that outperforms most custom-trained alternatives across most workloads, with no fine-tuning required for typical use. Voyage rerank and Jina rerank are credible alternatives in the same slot.
Composed: pgvector or Pinecone for storage, LlamaIndex for ingestion and orchestration scaffolding, Cohere rerank for the reranker. Total integration cost: 1 to 3 weeks. Total ongoing maintenance: near zero. Coverage: roughly 90 percent of conventional RAG infrastructure.
What the buy stack does not cover: the 10 percent that is moat
The 10 percent the buy stack does not cover is exactly the layer where retrieval quality compounds. Reinventing the 90 percent does not improve quality. Investing in the 10 percent does. The 10 percent is:
- Chunking strategy tuned to the corpus. LlamaIndex ships chunking primitives, but the strategy choice; how to chunk, at what granularity, with what overlap, with what metadata; is workload-specific. The team that chunks legal contracts the same way it chunks support tickets is leaving quality on the floor.
- Hybrid signal design. Dense embeddings handle semantic similarity well; BM25 handles keyword precision well; metadata filters handle scope. The right hybrid combination is workload-specific, the weighting is calibrated against the org’s relevance feedback, and no library can guess what mix is right for a given workload.
- Reranker fine-tuning against the org’s relevance feedback. Cohere rerank is good out of the box. It is better when fine-tuned against the org’s labeled relevance pairs, especially for domain-specific workloads (legal, medical, financial) where general-purpose rerankers underperform. The fine-tuning is workload-specific moat.
- Query rewriting and decomposition. Complex queries benefit from rewriting (expanding abbreviations, normalizing terminology) and decomposition (splitting multi-part queries into sub-queries that retrieve separately). The rules for both are workload-specific.
- Retrieval-aware evaluation. Measuring retrieval quality against the workload’s actual end-task requires an eval surface tuned to the workload, not a generic RAG benchmark. Per the case for buying the eval stack and building the evaluator, the runtime is buy and the evaluator is build; the principle applies to retrieval evaluation directly.
These five surfaces are where the senior engineering capacity should be deployed. Reinventing the 90 percent below them is the structural reason most RAG systems plateau on quality.
What you cannot differentiate by reinventing the 90 percent
A custom vector index does not retrieve better than Pinecone or pgvector. It retrieves the same; the distance computation is solved math, the indexing structures (HNSW, IVF) are public, and the implementations have been ground out by teams larger than any single org’s. The only thing a custom index might do better is hit a specific cost-performance frontier at extreme scale; outside that, it is a tax.
A custom chunking pipeline does not chunk better than LlamaIndex’s primitives. It chunks differently; and the difference is rarely better, often worse. The library has integrated learnings from thousands of deployments; an internal pipeline cannot match that surface area.
A custom reranker does not outperform Cohere rerank-3 at the base task. It might match it after weeks of training, and it might exceed it on a narrow workload after fine-tuning, but the lift comes from the fine-tuning step (which is moat work and lives above the reranker), not from rebuilding the cross-encoder from scratch.
A custom query orchestrator does not orchestrate better than LlamaIndex’s stable primitives. It orchestrates differently, often with idiosyncrasies that make it harder to onboard new engineers and harder to debug in production.
The general pattern: reinventing the 90 percent does not produce a better system. It produces the same system at higher cost, with worse documentation, and with a smaller maintainer pool. That is engineering self-harm.
The narrow cases where building is justified
Three. Each is narrow.
Case 1: extreme latency at high QPS. Workloads requiring sub-50ms retrieval at thousands of QPS sometimes cannot be served economically by managed providers. Custom co-located indexes can hit the cost-performance frontier here. This case exists at large consumer-AI products and at certain real-time systems; it does not exist at the median enterprise workload.
Case 2: extreme scale with custom partitioning. Workloads with billions of vectors and partitioning logic that depends on workload-specific signals (per-tenant isolation, per-jurisdiction routing, per-classification storage) sometimes require custom indexing because the managed options do not support the partitioning shape natively. This case is rare and shrinking; managed providers are adding partitioning features faster than orgs hit the wall.
Case 3: compliance forcing self-host. Some regulated environments cannot use managed services. Self-hosted pgvector, self-hosted Weaviate, or self-hosted Qdrant cover most of this; the buy decision is preserved, the deployment changes. The narrow remaining case is where even self-hosted vendor software is not approved, which is rare outside specific government workloads.
Outside these three cases, building is the wrong call. Inside them, building is justified; and even then, only the specific layer that hits the constraint should be built. The other layers stay bought.
Why teams keep building anyway
Three patterns recur. None of them are good reasons.
The infrastructure problem looks tractable. A senior engineer reads the HNSW paper, builds a working index in two weeks, declares victory. The two weeks become three months of edge-case handling. Six months later the engineer is full-time on the index and the moat layer has not shipped. The same engineer would have shipped four moat-layer features in that time against a bought index.
The team’s identity is custom infrastructure. Per the plumbing-vs-moat analysis, teams that have spent years building plumbing develop an identity around it. Reclassifying the work as commodity feels like reclassifying the team’s contribution. The fix is to redirect the same craftsmanship toward the moat layer above.
Procurement makes buying harder than building. Enterprise procurement cycles for new vendors run 3 to 9 months; engineering can build something in 6 weeks. The local math favors building. The 12-month math does not. The procurement function should be reformed rather than the engineering function being misallocated.
The migration math
For teams already running self-built RAG infrastructure, the migration math is favorable. Per layer:
- Vector index migration: 4 to 8 weeks for typical workloads, longer if the schema or query API drifted significantly from standard interfaces.
- Ingestion pipeline migration: 2 to 6 weeks to LlamaIndex primitives if document loaders and chunking strategies are reasonably modular.
- Reranker swap: 1 to 3 weeks to Cohere rerank, longer if a custom reranker is in active fine-tuning.
- Orchestrator migration: 4 to 8 weeks for non-trivial orchestration, easier if the orchestration logic is well-separated from the infrastructure.
Total per-layer migration: 1 to 6 months end to end depending on starting state. Recovered capacity: 30 to 50 percent of the senior engineering team that was maintaining the self-built stack. Payback period: typically inside the first quarter post-migration, because the recovered capacity goes into moat work that moves quality numbers fast.
What to encode
A short list.
- Default verb for RAG infrastructure: buy. Architecture review names the bought components per layer (pgvector or Pinecone for storage, LlamaIndex for orchestration, Cohere for rerank) and lists exception cases explicitly.
- Build exceptions are documented and time-bounded. “We build the vector index because we hit 5ms latency at 10K QPS” is documented. The exception is reviewed quarterly to check whether the buy options closed the gap.
- Moat layer ownership named. The chunking strategy, hybrid signal design, reranker fine-tuning, query rewriting, and retrieval evaluation each have named owners. This is where the senior engineering capacity goes.
- Migration plan named for self-built infrastructure. Teams currently running self-built RAG stacks have a migration plan with timelines, per-layer, reviewed quarterly per the re-litigation principle.
Frequently asked questions
What does RAG infrastructure mean in this context?
The substrate underneath a retrieval-augmented generation system: vector index, embedding pipeline, chunking and ingestion logic, basic retriever, reranker, query orchestrator, trace/eval hooks. It does not include the workload-specific retrieval logic on top; chunking strategy choices, hybrid signal weights, reranker fine-tuning. That layer is moat and is the legitimate build target.
Why does pgvector plus Pinecone plus LlamaIndex plus Cohere rerank cover 90 percent?
Because between them they ship most commodity layer at production grade. Pgvector handles storage when the workload fits Postgres. Pinecone handles storage when it outgrows it. LlamaIndex handles ingestion, chunking, orchestration. Cohere rerank handles cross-encoder reranking. The remaining 10 percent is workload-specific retrieval logic.
What can’t be differentiated by reinventing this stack?
Storage, indexing, basic retrieval, basic reranking, ingestion plumbing, and orchestration scaffolding. Reinventing these does not make retrieval better; it makes capacity smaller. Quality compounds above this layer; chunking strategy, hybrid signals, reranker fine-tuning, query rewriting, evaluation.
When is building RAG infrastructure justified?
Three narrow cases. Sub-50ms latency at high QPS where managed providers cannot match cost-performance. Billions of vectors with custom partitioning logic. Compliance environments where no managed option is approved. Outside these, building is engineering self-harm.
Why do teams keep building RAG infrastructure anyway?
The infrastructure problem looks tractable to a senior engineer. The team’s identity is built around custom infrastructure. Procurement makes buying harder than building short-term. None of these are good reasons; many produce the same outcome; capacity spent on commodity, moat starved.
What about the migration cost from self-built to bought RAG infrastructure?
4 to 12 weeks per layer if portability was preserved, longer if not. Recovered capacity compounds for years. Migration usually pays back inside the first quarter because the senior engineer maintaining self-built infrastructure gets redirected to retrieval logic that moves quality numbers.
How does this relate to RAG architectures in regulated industries?
Self-hosted buy options exist for most layer. Pgvector inside the org’s Postgres, self-hosted Cohere or local rerankers, open-source LlamaIndex. The compliance constraint changes deployment shape but rarely flips the buy decision; what changes is whether buy is SaaS or self-hosted.
What is the moat layer in RAG that should still be built?
Chunking strategy tuned to the corpus. Hybrid retrieval signal design. Reranker fine-tuning against the org’s relevance feedback. Query rewriting and decomposition. Retrieval-aware evaluation. Each is workload-specific and is where retrieval quality compounds.
Is this advice the same for startups and enterprises?
Yes. The binding constraint is engineering capacity in both cases. Enterprises sometimes face procurement friction; that friction makes the build option look cheaper short-term. The 12-month math says otherwise.
What about teams that already built their own RAG infrastructure two years ago?
Re-evaluate quarterly. The buy options have improved dramatically since 2024; Pinecone serverless, pgvector at production maturity, Cohere rerank-3, LlamaIndex’s stable primitives. A 2024 build decision that was correct then is often wrong now.
Key takeaways
- pgvector, Pinecone, LlamaIndex, and Cohere rerank cover roughly 90 percent of conventional RAG infrastructure at production grade in 2026.
- Reinventing the 90 percent does not produce better retrieval. It produces the same retrieval at higher capacity cost.
- The 10 percent the buy stack does not cover; chunking strategy, hybrid signals, reranker fine-tuning, query rewriting, retrieval evaluation; is exactly where retrieval quality compounds.
- Three narrow cases justify building infrastructure: extreme latency at high QPS, extreme scale with custom partitioning, compliance forcing pure self-host. Outside these, build is the wrong call.
- Migration math favors moving from self-built to bought: 1 to 6 months end to end, payback inside the first quarter post-migration.
The case against building RAG infrastructure in 2026 is not that building is impossible; it is that building is no longer differentiating. The teams shipping the best RAG systems are the teams that bought the substrate and invested the recovered capacity in the layer above. The teams plateauing on quality are the teams still maintaining custom indexes. The fix is one architecture review and one quarter of disciplined migration; the cost of not doing it is the next four quarters of stagnant retrieval quality.
Arthur Wandzel