Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 17 min read

The 4 Questions That Decide Whether an AI Capability Should Be Built, Bought, or Hired

The 4 Questions That Decide Whether an AI Capability Should Be Built, Bought, or Hired

Almost most AI build-vs-buy-vs-hire decision in 2026 resolves through four questions, asked in order, scored against the specific capability under review. The questions are: how much does competitive position depend on this capability being uniquely good (moat density); how many other capabilities does this one touch (integration depth); how often does the right answer change (decision velocity); and does the org have or can it acquire the talent to operate this capability inside the timeline the work demands (talent fit). Most decisions become obvious once those four are scored. The decisions that remain unobvious after scoring are the ones worth a meeting; everything else can be resolved by the matrix in fifteen minutes. This piece names the four questions, shows how each scores, and gives a worked checklist that turns the four scores into a verb.

This is a spoke under the AI build-vs-buy-vs-hire decision matrix for 2026. The matrix’s second principle names the three axes that drive the decision; this piece is the operationalization; how to score a capability against those axes, plus the talent dimension that the matrix’s sixth principle treats as a first-class input.

Why four questions instead of a longer framework

Decision frameworks tend to expand. A team starts with build-vs-buy as a binary, adds hire as a third option, then adds cost considerations, then risk, then integration complexity, then strategic alignment, then ten more axes. Six months later the framework has 20 questions and nobody uses it because answering 20 questions per capability across 40 capabilities is 800 questions and that is a quarter of work nobody has.

The version that works is shorter. Four questions, asked in fixed order, with a five-row scoring rubric per question. Total cost per capability: 10 to 15 minutes. Total cost for a 40-capability ledger: roughly one day of the architecture group’s time, executed once and re-executed quarterly per the matrix’s seventh principle.

The four questions are not the only ones that matter for sourcing. They are the four whose answers determine the verb in the overwhelming majority of cases. Other considerations; vendor stability, contract terms, regulatory constraints, internal politics; are real but secondary. They modulate the verb after the four questions name it; they do not select it.

Question 1: How much does competitive position depend on this being uniquely good?

The first and most important question. It asks whether the capability is a place where the organization expects to differentiate against competitors, or whether it is a place where the organization needs to be roughly as good as everyone else but not better.

Score on a five-row rubric:

  • 5; The capability is the primary axis of competitive differentiation. If a competitor matched it, our entire product position changes.
  • 4; The capability is a secondary axis of differentiation. Strong execution here is a meaningful advantage but not the whole position.
  • 3; The capability needs to be roughly category-best but not uniquely strong. Parity with the leading vendor’s offering is sufficient.
  • 2; The capability needs to be functionally adequate. Below adequate produces customer complaints; above adequate produces no incremental win.
  • 1; The capability is a commodity. Differentiation here is invisible to customers and to the income statement.

Capabilities scoring 5 or 4 are build candidates. Capabilities scoring 1 or 2 are buy candidates. Capabilities scoring 3 are the contested middle, where the next three questions do most of the work.

The pattern that misleads teams: scoring everything as 4 or 5 because everything feels strategic when you build it. The discipline is to ask “if a competitor matched this exactly tomorrow, what would change in our customer wins?” If the honest answer is “not much,” the score is 2 or 3, not 4.

Question 2: How many other capabilities does this one touch?

The second question. It asks how deeply integrated the capability is into the broader system; not how complex the capability is internally, but how many other capabilities depend on it or feed into it.

Score on a five-row rubric:

  • 5; The capability touches most other capability in the system. Foundation models are the canonical example; the model is called by most agent, most retrieval pipeline, most evaluation, most observability hook.
  • 4; The capability touches most capabilities. Agent orchestration touches retrieval, eval, observability, and most product surfaces.
  • 3; The capability touches several capabilities at boundaries. The eval suite touches the agent layer, the retrieval layer, and the observability layer at well-defined boundaries.
  • 2; The capability touches a handful of specific capabilities at one or two boundaries. A reranker touches retrieval and possibly eval; nothing else.
  • 1; The capability is peripheral. Red-team tooling touches the eval pipeline at one boundary and is otherwise standalone.

Capabilities scoring 5 or 4; high integration depth; resist buy because the buy contract has to penetrate to the heart of the architecture, which creates lock-in risk and integration drag. Capabilities scoring 1 or 2 tolerate buy because the contract terminates at one or two boundaries.

The pattern that misleads teams: confusing internal complexity with integration depth. A reranker can be internally complex (model architecture, training data, fine-tuning pipeline) and have low integration depth (it talks to retrieval and eval, period). A reranker that is internally simple but integration-deep is rare; the question is about the integration footprint, not the engineering effort.

Question 3: How often does the right answer change?

The third question. It asks the decision velocity of the underlying technology; how often the optimal answer for this capability changes such that the prior answer becomes wrong.

Score on a five-row rubric:

  • 5; Optimal answer changes quarterly or faster. Foundation models are the canonical example; the right model on January 1 is often not the right model on April 1.
  • 4; Optimal answer changes most 6 to 12 months. Agent frameworks, prompt patterns, and reranker architectures change at this pace.
  • 3; Optimal answer changes annually. Eval-set design, observability schema, and retrieval architectures are typically annual.
  • 2; Optimal answer changes most 2 to 3 years. Vector index providers, deployment infrastructure, and data labeling vendors typically last this long.
  • 1; Optimal answer changes most 5+ years. Underlying database architecture, network topology, and identity providers are this slow.

Capabilities scoring 5 or 4; high decision velocity; resist build because the build calcifies a quarterly answer into multi-year code. By the time the build ships, the optimal answer has moved. Capabilities scoring 1 or 2 tolerate build because the answer holds long enough for the build to amortize.

The pattern that misleads teams: assuming AI capabilities many have decision velocity 5. The foundation layer is at 5, but the vector index layer is at 2 or 3. Not most AI capability changes quarterly; the discipline is to score per capability rather than apply the average velocity across the stack.

Question 4: Does the org have or can it acquire the talent on the timeline?

The fourth question, and the one most often skipped because it feels like a separate concern. It is not separate. The talent fit determines whether build is feasible at many, and whether hire is permanent or rented.

Score on a five-row rubric:

  • 5; The org has the talent on staff, with depth (multiple senior engineers fluent in this capability), and the talent is retained for at least the project timeline.
  • 4; The org has the talent on staff, but with a single point of failure (one senior engineer) or with retention risk inside the project timeline.
  • 3; The org does not have the talent but can acquire it through normal hiring inside 4 to 6 months; talent that is available, hireable, and likely to retain.
  • 2; The org does not have the talent, hiring would take 9+ months at substantial premium, but rented talent (agency, fractional CTO, specialized consultancy) is available and credible.
  • 1; The org does not have the talent, cannot hire it on any reasonable timeline, and credible rented talent is not available either.

Capabilities scoring 5 or 4 are build-feasible. Capabilities scoring 3 are hire-feasible (permanent). Capabilities scoring 2 are hire-feasible (rented). Capabilities scoring 1 are buy-mandatory regardless of what the other three questions say; the org cannot operate what it cannot staff.

The pattern that misleads teams: scoring talent fit optimistically. A team with one senior engineer fluent in vector search is not a 5; that engineer leaving leaves the capability orphaned. Score with retention risk and depth in mind, not nameplate count. The full breakdown of how to evaluate hiring AI talent is in the questions to ask AI developers piece.

The worked checklist

Score each capability on the four questions, then look up the verb in the table. The default verb is the most common resolution; the alternates are when specific scores combine atypically.

Q1: MoatQ2: IntegrationQ3: VelocityQ4: TalentDefault verb
5 (high)4-5 (deep)1-3 (low-mid)4-5 (have)Build
4-5 (high)4-5 (deep)1-3 (low-mid)2-3 (need)Hire to build
4-5 (high)4-5 (deep)4-5 (high)anyBuild, but accept rebuild cycles
3 (mid)3 (mid)3 (mid)3-5 (have or hireable)Build
3 (mid)1-2 (shallow)3-5 (high)anyBuy
1-2 (low)1-3 (any)anyanyBuy
anyanyany1 (cannot acquire)Buy (talent override)

The single most useful pattern: the talent score is an override. A capability that scores 5/5/3/1 (high moat, deep integration, moderate velocity, no talent) cannot be built no matter how strategic it looks; it must be bought, with the recognition that the buy is a strategic compromise the org should be working to undo.

The second most useful pattern: high decision velocity (Q3 = 4-5) plus high moat (Q1 = 4-5) is the build-but-rebuild case. Foundation models, before they fully commoditized, occupied this corner. Agent orchestration occupies it now. The right answer is build, but with the explicit expectation that the build will be substantially rewritten most 12 to 18 months. Budgets and architecture should reflect that.

Three worked examples

The framework is easier to internalize against three concrete capabilities scored end-to-end.

Example 1: vector index for retrieval. Q1 (moat) = 1; vector storage is commodity. Q2 (integration) = 2; the index is touched by retrieval and observability, nothing else. Q3 (velocity) = 2; index providers change most 2 to 3 years. Q4 (talent) = 4; the team has one senior engineer who runs the existing self-managed index. Verdict: Q1 dominates. Buy. The talent score does not move the verdict because the buy is the default for low-moat capabilities, and the senior engineer is freed to do moat work.

Example 2: agent orchestration for the org’s specific multi-agent workflow. Q1 (moat) = 5; the orchestration is the product. Q2 (integration) = 5; orchestration touches most other capability. Q3 (velocity) = 4; orchestration patterns change most 6 to 12 months. Q4 (talent) = 3; the org has one senior engineer fluent in this and is hiring two more inside 4 months. Verdict: build, with the expectation that the build will be substantially rewritten most 12 to 18 months. The talent score requires the new hires to land; if they don’t, the verdict shifts to “hire to build” through an agency partner.

Example 3: eval suite for the workload. Q1 (moat) = 4; the eval suite is a strong differentiator because it tells the org whether the AI is working. Q2 (integration) = 3; the suite touches agent, retrieval, and observability at boundaries. Q3 (velocity) = 3; eval-set design changes annually. Q4 (talent) = 2; the org has nobody on staff fluent in eval engineering and cannot hire one for 9+ months. Verdict: hire to build. Engage a credible specialist agency to build the eval suite, with explicit transfer-of-ownership terms so the suite ends up in the org’s repo. The decision matrix at the AI development agency vs in-house team analysis covers what those terms should look like in practice. Per the matrix’s fifth principle, eval is build-or-hire, rarely buy.

Frequently asked questions

What if I cannot score a capability confidently on one of the four questions?

That is the signal that more research is needed before deciding. The confidence calibration matters: scoring a capability without confidence produces a verb that does not survive the first quarter. If the moat-density question is unclear, run a small competitive analysis. If the integration depth is unclear, draw the architecture diagram. If the velocity is unclear, look at the last 18 months of vendor releases. If the talent fit is unclear, talk to two recruiters and three candidate consultancies. The cost of doing the research is small; the cost of skipping it is months of mis-sourced work.

What about cost; does it not enter the decision?

Cost is a derived attribute of the verb, not a separate input. The verb determines the cost structure: build is engineering capacity, buy is contract spend, hire is talent budget. Cost considerations apply at the budgeting layer, not the sourcing layer. Per the Pillar 2 economics manifesto, AI economics are organized around evaluation cost rather than feature cost; the build/buy/hire decision is upstream of that organization.

What about regulatory or compliance constraints?

Those are overrides that can flip a verb regardless of the four-question scores. A capability that handles regulated data and cannot be sent to an external vendor flips to build (or self-hosted buy) regardless of the moat-density score. The four questions are the default; overrides apply on top. Most organizations have one or two override constraints, not twenty.

How often should the four questions be re-asked per capability?

Quarterly per the matrix’s seventh principle. Most capabilities will keep the same score from quarter to quarter, and the re-ask takes two minutes per capability for those. Capabilities whose score has changed; typically because Q3 (velocity) shifted, Q4 (talent) shifted, or a vendor matured the buy option; get the full 10-to-15-minute review.

What is the most common scoring error in practice?

Scoring Q1 (moat density) too high. Teams default to seeing most capability as strategic because they have invested attention in it. The discipline is to imagine a competitor shipping the same capability tomorrow and asking whether anything in the org’s customer wins changes. If the honest answer is “not much,” the score is 2 or 3, not 5. We have seen organizations that score most capability at 4 or 5 and consequently build everything; their AI roadmap then under-performs because capacity is spread across capabilities that were not moat.

Do the four questions apply to startups as well as enterprises?

Yes, with weighting differences. Startups should be more aggressive on Q1 (treat fewer things as moat, since the moat at a startup is product-market-fit not infrastructure depth) and more sensitive on Q4 (a startup’s talent constraint is the most binding constraint it has). Enterprises should be more sensitive on Q2 (deep integration is hard to undo at enterprise scale) and Q3 (procurement cycles cap how fast they can swap a buy decision). The framework is the same; the relative weights differ.

Can a capability legitimately score 5 on many four dimensions?

Rarely, and when it does, the answer is “build with maximum strategic seriousness.” High moat, deep integration, high velocity, and full talent depth describes a capability that is the heart of the AI product and that the org is uniquely positioned to operate. Foundation models in 2018-2020 were in this corner for a handful of labs. Most organizations do not have any capability in this corner; the few that do should treat them as their flagship engineering investments.

What is the relationship between this framework and the AI capability ladder?

The four questions are the per-capability decision process. The AI capability ladder is the pre-resolved output for a set of common capabilities; the capabilities that almost usually score the same way and therefore have a default verb across most organizations. The ladder is the shortcut; the four questions are the work behind the shortcut, used when a capability is non-default or when the shortcut produces an answer that does not feel right.

How does this framework handle hybrid arrangements?

Hybrid arrangements (build the moat layer, buy the rails layer, hire the senior architect) are the default per the matrix’s eighth principle. The framework handles them by being applied per sub-capability rather than per top-level capability. Retrieval is not a single line in the ledger; it is a stack of vector index (Q-scores favor buy), embedding model (favor buy), chunking (favor build), reranker (depends), and retrieval-aware eval (favor build or hire). Each sub-capability gets its own four-question pass. The composition is the answer.

What’s the cost of running this exercise across a 40-capability ledger?

Roughly one full day for a senior architecture group, executed once at the start and re-executed quarterly. The output is a capability ledger with a verb on each row plus the four scores supporting it. The cost is small relative to the cost of getting the verbs wrong, which compounds across most quarter the wrong verbs persist.

Key takeaways

  • Four questions resolve the verb on most AI capabilities: moat density, integration depth, decision velocity, and talent fit.
  • Score each on a five-row rubric, then look up the verb in the worked checklist; most decisions resolve in under 15 minutes per capability.
  • Talent fit is an override: capabilities the org cannot staff cannot be built, regardless of strategic importance.
  • Cost is a derived attribute of the verb, not a separate input; regulatory constraints are overrides applied on top of the four-question result.
  • Re-run the four questions quarterly; most capabilities keep the same score, but the ones that change are the ones worth catching early.

The framework’s value is not that it is novel; most of the underlying ideas are conventional. The value is that it is short enough to run on a 40-capability ledger and disciplined enough to prevent the implicit defaulting that produces most sourcing errors. Organizations that run the four questions deliberately end up with capability ledgers that are defensible, reviewable, and updatable. Organizations that do not end up with capability ledgers that are an artifact of historical accident, rationalized after the fact.

Last Updated: Jun 14, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles