Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 16 min read

Why AI Build-vs-Buy Decisions Made in 2024 Should Be Re-Litigated This Quarter

Why AI Build-vs-Buy Decisions Made in 2024 Should Be Re-Litigated This Quarter

Most AI build-vs-buy-vs-hire decisions made in 2024 are wrong by 2026 not because the decisions were bad at the time but because the conditions that produced them have shifted faster than the decisions have been reviewed. Foundation model pricing has fallen 70 percent year over year. Agent frameworks have matured from prototypes to production substrate. Vector indexing has commoditized. Eval tooling that did not exist in 2024 ships as production product in 2026. Talent supply for specific AI fluencies has rebalanced. Six conditions have moved meaningfully, and any decision made before they moved is operating against an obsolete map. The cost of running with stale decisions is not visible until a re-litigation forces the comparison; once forced, the typical organization finds 25 to 40 percent of its sourcing verbs are now wrong. This piece names the six conditions that have shifted, the diagnostic test for whether a specific decision is stale, and the playbook for running the re-litigation inside one quarter.

This is a spoke under the AI build-vs-buy-vs-hire decision matrix for 2026. The matrix’s seventh principle is that most decision is re-litigated on a quarterly cadence; this piece is the operationalization for organizations that have not been doing the quarterly re-litigation and are now facing an 18-to-24-month accumulated drift.

Why 2024 decisions are uniquely stale

2024 was the year when most enterprises made their first serious AI sourcing decisions. The chat-GPT-driven wave of 2023 produced experiments; 2024 produced architectures. Procurement signed contracts. Engineering teams committed to build versus buy. Agencies were hired or rejected. The decisions were made under a specific set of conditions: foundation models were expensive enough that running a lot of them was a strategic question, agent frameworks were not yet production-grade, vector databases were a frontier choice, eval tooling was a research category, AI talent was uniformly scarce, and the vendor landscape was fragmented across dozens of small players.

Almost most one of those conditions has flipped or substantially shifted between 2024 and 2026. The shifts are not incremental; they are step-function changes that change the math on the underlying decision. A 2024 decision against the 2024 conditions can be exactly correct and still be wrong in 2026, because the world the decision was made against no longer exists.

The organizations that have been running quarterly re-litigation per the matrix’s seventh principle have absorbed these shifts as small course corrections. The organizations that have not; which is most organizations; are now sitting on architectures whose verbs have drifted out from under them. The accumulated drift is the topic of this piece.

Condition 1: foundation model unit economics

The largest shift. In Q1 2024, GPT-4 cost roughly $30 per million input tokens and $60 per million output tokens. Anthropic’s Opus was at similar pricing. The economics of running a lot of foundation model calls were a serious budget question; AI products had to be designed to minimize calls, with caching, retrieval, and aggressive prompt compression as first-order architectural concerns.

By Q2 2026, the equivalent capability tier from the leading providers is priced 60 to 80 percent below where it was, with continued downward trajectory. Token cost is no longer the binding constraint on most AI architectures; it has been replaced by latency and quality as the primary design concerns.

The implication for build-vs-buy decisions: capabilities that were “build because we need to minimize calls” in 2024 are often “buy because the calls are now cheap” in 2026. Self-built compression layers, custom caching schemes, and bespoke prompt-shortening infrastructure many become less defensible when the underlying call cost has fallen by 70 percent. The build was correct in 2024; the build is now over-engineering.

What 2024 decisions to revisit: any “we built X to reduce token spend” decision. The math has shifted; the build’s payback has shrunk; the buy alternative is often now competitive.

Condition 2: agent framework maturity

The second-largest shift. In 2024 the agent framework landscape was a research zoo. LangChain, LlamaIndex, AutoGPT, BabyAGI, and a dozen other projects many claimed to handle multi-step agent workflows; none of them shipped production-grade. Teams that needed multi-step agents in 2024 had to build their own loop, error handling, tool routing, and state management because the frameworks could not be relied on.

By 2026, the OpenAI Agents SDK, the Anthropic agent harness, LangGraph (now production-grade), Pydantic AI, and AutoGen have many matured to the point where the agent loop layer is a credible buy. The orchestration logic on top is still build (per the matrix’s fourth principle), but the underlying loop is no longer something a serious team builds from scratch.

The implication: 2024-vintage custom agent loops are now plumbing, not moat (per the AI plumbing-vs-moat piece). The build was correct in 2024; the build is now consuming engineering capacity that should be in moat work.

What 2024 decisions to revisit: any custom agent loop, custom tool-call dispatcher, or custom state machine for agent execution. Migration to a 2026 framework typically takes 4 to 8 weeks and recovers the engineering capacity for moat work.

Condition 3: vector indexing commoditization

The third shift. In 2024 vector indexing was a frontier engineering category. Pinecone, Weaviate, and Qdrant were credible but young; Faiss self-managed was the dominant fallback because the managed offerings did not yet have the scale and reliability senior engineers wanted. Teams routinely built their own indexing infrastructure because the buy options had visible weaknesses.

By 2026 vector indexing is commodity. Pinecone, Weaviate Cloud, Qdrant Cloud, Turbopuffer, and pgvector many ship production-grade with horizontal scale, replication, and predictable pricing. The performance gap between self-managed Faiss and managed alternatives has compressed to “engineering preference” rather than “engineering necessity.”

The implication: self-managed vector indices that exist because of 2024-era performance concerns are often migration candidates in 2026.

What 2024 decisions to revisit: any self-managed vector index where the original justification was “managed options aren’t ready.” The original justification has expired; the migration is now usually correct. The detail on this is in the vector database options for RAG analysis.

Condition 4: eval tooling now exists

The fourth shift, and the most under-appreciated. In 2024 eval tooling was a research category. Promptfoo, Inspect, OpenAI Evals, and Braintrust either did not exist or were too immature to build production eval suites against. Teams that wanted real eval discipline had to build the eval harness themselves; the test runner, the scoring rubric infrastructure, the regression detection, the threshold-locking system.

By 2026 the harness layer ships as multiple production products. The eval suites running on those harnesses are still build or hire (per the matrix’s fifth principle), but the harness underneath is no longer a build.

The implication: 2024-vintage custom eval harnesses are now the wrong layer to be building. The harness is bought; the suite is built.

What 2024 decisions to revisit: custom eval test runners, custom scoring infrastructure, custom regression-detection pipelines. Most of this layer migrates cleanly to a 2026 eval framework, recovering the engineering capacity to build deeper eval suites against the workload.

Condition 5: talent supply rebalanced

The fifth shift. In 2024 the AI talent market was uniformly hot; most AI-fluent senior engineer was being courted, salaries were inflating quarterly, and hire-or-build decisions tilted aggressively toward “build because we cannot hire.” Many 2024 decisions to outsource AI work to agencies were made because the org could not staff the work in any reasonable timeline.

By 2026 the market has rebalanced; not cooled, but redistributed. Foundation model engineers are still scarce. Agent orchestration engineers are scarce. Eval engineers are scarce. But generalist senior engineers with one to two years of AI exposure are now reasonably common, and many can be promoted into AI-fluent roles inside the org with focused training. The hire decision in 2026 is less binary than it was in 2024.

The implication: 2024-vintage decisions to outsource because “we cannot hire” should be re-tested against the current talent market. In many cases the org now has internal candidates who can be promoted into the role faster and cheaper than the outsource arrangement was costing.

What 2024 decisions to revisit: long-term agency engagements that were originally hire-substitutes rather than expertise-buys. The full breakdown of when to bring this work back in-house is in the build AI in-house vs outsource analysis and the technical cofounder vs AI agency comparison.

Condition 6: vendor consolidation across the stack

The sixth shift. In 2024 the AI tooling vendor landscape had hundreds of small players, many of them adjacent to each other in capability. Choosing among them was hard; many teams bought multiple overlapping products because no single vendor covered the workflow. Vendor risk was a genuine concern because half the small players were unlikely to survive 18 months.

By 2026 the landscape has consolidated. The leading observability vendors (Langfuse, Helicone, Arize, Braintrust, LangSmith) cover the workflow with overlapping capability. The leading prompt management vendors are similarly consolidated. The leading agent frameworks have settled into a small set with similar-but-different design philosophies. Vendor risk is now lower; vendor selection is now a more meaningful choice because the choices have differentiated.

The implication: 2024-vintage decisions that were “buy multiple overlapping products to hedge” are now consolidation candidates. A single vendor in each layer can cover what previously required two or three.

What 2024 decisions to revisit: multi-vendor stacks in observability, prompt management, and agent tooling. Consolidating onto a single vendor per layer typically reduces total spend by 30 to 50 percent and reduces operational complexity meaningfully.

Diagnostic: is a specific 2024 decision stale?

Three tests applied in order to a specific 2024-vintage decision.

Test 1: did any of the six conditions move? Look at the original justification for the decision. Did it depend on 2024-era token economics, 2024-era agent framework immaturity, 2024-era vector indexing immaturity, 2024-era eval tooling absence, 2024-era talent scarcity, or 2024-era vendor fragmentation? If yes, the decision is a candidate for re-litigation.

Test 2: would the four questions score the capability differently today? Run the four questions framework against the capability with current 2026 conditions. If the resulting verb is different from the 2024 verb, the decision is stale.

Test 3: what would the decision cost to reverse? A stale decision that is expensive to reverse is still worth reversing if the ongoing cost of running with it is high enough. A stale decision that is cheap to reverse should be reversed by default. The reversal cost determines the priority, not the necessity.

A decision that fails any of the three tests is on the re-litigation list. A decision that fails two or three of them is at the top of the list.

The one-quarter re-litigation playbook

For organizations that have not been running quarterly re-litigation, catching up takes one focused quarter. The playbook has four phases.

Phase 1, weeks 1-2: inventory. List most AI capability the org runs, the verb attached to each (build/buy/hire), and the date the verb was decided. The list is typically 30 to 50 capabilities; producing it is a 2-day exercise for the architecture group. The output is the capability ledger that the matrix’s first principle requires.

Phase 2, weeks 3-5: triage. Run the three-test diagnostic against most capability whose verb was decided in 2024 or earlier. Mark each as stale (verb is wrong), drift (verb is right but supporting structure has shifted), or stable (verb is right and supporting structure is stable). Typical distribution: 25 to 40 percent stale, 30 to 40 percent drift, 25 to 40 percent stable.

Phase 3, weeks 6-9: prioritize and scope. For stale decisions, score the cost of reversal and the ongoing cost of running with the wrong verb. Stale decisions with high ongoing cost and moderate reversal cost are top priority. Stale decisions with low ongoing cost or high reversal cost can wait. Typical output: 5 to 12 decisions get scoped for action this quarter; the rest go on a roadmap for the next two quarters.

Phase 4, weeks 10-13: execute and standardize. Execute the highest-priority reversals. Establish the quarterly review cadence so this catch-up exercise becomes a 30-minute review next quarter rather than a 13-week catch-up. Update the architecture documentation to reflect the new verbs.

The cost of the playbook is roughly 15 to 25 percent of the architecture group’s capacity for one quarter, plus the engineering work to execute the reversals (highly variable, but typically 4 to 12 weeks per reversal). The benefit is the elimination of the accumulated drift and the establishment of a discipline that prevents re-accumulation.

Frequently asked questions

How do I know if my organization has been running quarterly re-litigation?

If you have a written capability ledger updated each quarter and a calendar item for the architecture group to review verbs, you are running it. If you cannot point to either, you are not. Most organizations are not, even when senior leadership believes the practice is in place; the cadence requires explicit calendar enforcement, not just an intention.

Is 2024 special, or will 2026 decisions be similarly stale by 2028?

2024 is special because the AI tooling landscape underwent more change between 2024 and 2026 than any prior two-year window. 2026 to 2028 is unlikely to see equivalent magnitude of shift in the same categories; vector indexing, agent frameworks, and eval tooling have already commoditized. But other categories (multi-modal models, on-device inference, custom hardware) are at the 2024-equivalent stage of maturity in 2026 and will produce equivalent staleness in 2028. The discipline of quarterly re-litigation is the same.

What if my 2024 decisions still feel right after the diagnostic?

Some will. The expected hit rate for stale decisions is 25 to 40 percent of 2024-vintage decisions, not 100 percent. Decisions that pass the three-test diagnostic are confirmed as stable, which is itself a useful output of the exercise; the architecture group can stop second-guessing them.

How do I handle stakeholders who resist re-litigation?

Frame it as architectural hygiene rather than as second-guessing. The goal is not to relitigate most prior decision; the goal is to identify the specific decisions whose underlying conditions have shifted enough to merit a fresh look. The diagnostic is fast, the typical hit rate is moderate, and the alternative is running with an accumulating drift that surfaces as architectural pain 6 to 18 months later. Most stakeholders accept the framing once it is presented as hygiene.

Should I re-litigate decisions made by a previous architect or team?

Yes, with the same diagnostic. The decision’s provenance does not affect its current correctness. A decision made by a previous team can be stable, drifted, or stale on the same axes as a decision made by the current team. The political cost of reviewing previous-team decisions is real but bounded; the cost of running with stale decisions is unbounded.

How does the re-litigation interact with vendor contracts?

Stale decisions that involve multi-year vendor contracts can sometimes be reversed inside the contract through workload reduction (move some workload to the new chosen vendor while running down the existing contract). When a contract genuinely cannot be reversed inside the contract term, the decision is “buy the next contract differently”; the re-litigation produces the verb for renewal, not for immediate switch.

What if the re-litigation reveals that a 2024 buy decision should be a 2026 build decision?

This is less common than the reverse but does happen; particularly in the orchestration layer where buy options that looked acceptable in 2024 now visibly cap the ceiling. The right response is to scope the build, get talent in place to operate it, and migrate the workload deliberately. The migration is typically 6 to 12 months for non-trivial cases; it is worth doing because the buy-to-build migration’s payback is years, not quarters.

How does this exercise relate to the AI capability ladder?

The capability ladder gives the default verb for common capabilities under 2026 conditions. The re-litigation exercise compares 2024-decided verbs to those defaults; mismatches are the candidates for re-litigation. The ladder is the reference; the re-litigation is the workflow that aligns the org’s actual decisions to the reference. The full ladder is at the AI capability ladder piece.

What’s the worst-case if we skip the re-litigation?

The drift compounds. Stale decisions consume engineering capacity and operational budget at rates 20 to 40 percent above optimal. The accumulated drift becomes visible 12 to 18 months later as “we cannot ship X because the platform was built around assumptions that no longer hold.” At that point the re-litigation is forced, but the cost of catching up is higher because more decisions have drifted further. The math favors running the re-litigation now rather than later.

Is there a lighter version of this exercise for resource-constrained orgs?

Yes. Run only the three-test diagnostic against the top 10 highest-impact AI capabilities; the ones where build/buy/hire choices touch the most engineering capacity or the most contract spend. This is a one-week exercise rather than a one-quarter exercise. It will not catch most stale decision, but it will catch the most expensive ones, which is most of the value.

Key takeaways

  • Six conditions have shifted between 2024 and 2026: foundation model economics, agent framework maturity, vector indexing commoditization, eval tooling maturity, talent supply rebalance, and vendor consolidation.
  • 25 to 40 percent of 2024-vintage AI sourcing decisions are now stale; the typical organization is running with 18 to 24 months of accumulated decision drift.
  • The diagnostic is three tests: did any condition shift, would the four questions score the capability differently today, and what is the reversal cost.
  • The one-quarter re-litigation playbook is inventory, triage, prioritize, execute; typically 5 to 12 reversals per quarter for an organization catching up.
  • Establishing the quarterly review cadence after catch-up turns the next review into a 30-minute exercise rather than a 13-week project.

The cost of running the re-litigation is one quarter of architecture-group attention plus the engineering execution on the highest-priority reversals. The cost of skipping it is the compounding drift of most quarter that the wrong verbs persist. Organizations that internalize this become AI architectures that compound; organizations that do not become AI architectures whose most quarter is a small disappointment that nobody quite explains. The compounding goes one direction or the other; standing still is not an option in a category whose underlying tooling moves this fast.

Last Updated: Jun 14, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles