Home About Who We Are Team Services Startups Businesses Enterprise Case Studies Blog Guides Contact Connect with Us
Back to Guides
Enterprise Software 15 min read

The AI agency lock-in playbook (and how clients can defuse it)

The AI agency lock-in playbook (and how clients can defuse it)

Lock-in inside an AI agency engagement is rarely a single decision; it is the cumulative residue of six small ones that nobody flagged. Each is defensible in isolation; the agency holds the API keys because day-one provisioning was urgent, the prompt registry sits in their repo because that is where the engineer was working, the eval suite lives behind their CI because the runner was already configured. Twelve months later the founder discovers that none of those decisions can be reversed without a six-figure re-implementation, and the engagement is no longer a relationship; it is a dependency.

This piece is a spoke under the AI agency manifesto. The manifesto names what an AI dev partner should be; this piece is the adversarial map of how partnership quietly becomes capture, and the moves a client can make to reverse it. Lock-in is not a moral failing; it is a commercial gravity well. Most shortcut that benefits short-term velocity tightens the gradient toward the agency’s continued involvement. Defusing it requires deliberate counter-engineering from day one.

The first half of this playbook describes the six mechanisms. The second half is the defusal protocol. None of the defusal moves are exotic; many of them are unfamiliar to most procurement teams.

Why lock-in is the default outcome

A 2024-vintage AI engagement is a portfolio of artifacts spread across a dozen tools; Anthropic and OpenAI consoles, Modal volumes, Hugging Face workspaces, W&B projects, Cursor sidebars, Pinecone clusters, GitHub Actions runners. Each tool is a credential surface. Each surface defaults to whoever provisioned it first, and each provisioning decision is made under time pressure during week one, when the client is busy negotiating scope and the agency is busy shipping the demo.

The result is structural. Twelve weeks in, the agency has accumulated a constellation of small ownership claims; none load-bearing in isolation, many load-bearing in combination. The engagement became sticky not by malice but by gradient. Defusing it requires the client to spend operational effort against the gradient; more friction in week one in exchange for less friction at month twelve. Most clients will not.

Mechanism 1: the agency holds the keys

The agency creates the Anthropic, OpenAI, Modal, and Pinecone accounts during week one because the client procurement team has not processed the credit card form. Billing sits on an agency master account with cost passed through. Six months later the production system runs against API keys nobody on the client side has ever seen.

Why it locks in. Rotating an API key is trivial; replacing the system that depends on a key the client cannot rotate is a two-month project; fresh accounts, migrated fine-tuning jobs and embedding indexes, end-to-end retest, reissued secrets.

The failure footprint. The founder terminates and discovers production still flows through agency-owned keys; the agency’s AP sees a $40K monthly invoice and asks, reasonably, when it stops. The founder cannot stop it without a six-week migration the senior engineer who could lead it just left.

Mechanism 2: prompt registry kept private

The agency’s senior prompt engineer keeps production prompts in a private repo, a Cursor sidebar, or a Notion page the team has iterated on since week two. Most two weeks the prompts ship to production through a script the engineer maintains. The client sees outputs but not source.

Why it locks in. A prompt for a non-trivial agentic system is a multi-thousand-token construct; tool-use schemas, retrieval templates, response formats, refusal policies, chain-of-thought scaffolding. Reproducing it from outputs is impossible; from memory it takes weeks of senior engineering. The prompt is the system, and if the prompt lives in the agency’s tooling, the system lives there too.

The failure footprint. At handoff the client receives a Sunset Package missing the actual production prompt because the engineer who tuned it has been pulled to a different account, and nobody else can reconstruct the variant that ships at 09:00 PT on Tuesdays.

Mechanism 3: eval suite in the agency repo

The eval suite; datasets, judge prompts, rubric thresholds, regression cases; lives in the agency’s GitHub org. The client gets weekly eval reports as PDFs, rarely the underlying code. When a regression hits, the agency runs the suite; the client cannot.

Why it locks in. Eval is the only ground truth for whether the system has gotten better or worse. A client who cannot run their own evals is structurally dependent on the agency’s interpretation of most change; cannot validate a successor’s claims, cannot test a model swap (Claude Opus 4.7 vs the next release), cannot enforce a quality bar in writing because the bar exists only in the agency’s CI.

The failure footprint. The client brings the system in-house, hires two engineers, and discovers they cannot tell whether their changes regress because the eval set lives in a repo the agency has not transferred. They ship a regression and blame themselves rather than the absence of evals.

Mechanism 4: model artifacts in the agency cloud

Fine-tuned weights, LoRA adapters, distilled checkpoints, and embedding indexes sit in the agency’s Hugging Face workspace, S3 bucket, or Modal volume. The client has rarely downloaded a copy; production code references paths into agency-controlled storage.

Why it locks in. Model artifacts are large, expensive to recreate, and often impossible to reproduce exactly; the training run depended on a specific dataset snapshot, seed, and software version that has since drifted. Re-fine-tuning is not “run the script again”; it is a multi-week project to reconstruct lineage.

The failure footprint. The client terminates and asks for weights. The agency delivers a tarball whose lineage is unclear, whose hyperparameters live in a Slack thread, and whose dataset references files that have since been deleted. Technically delivered, practically undeployable. See the hidden Y problem in AI agency contracts for the artifact taxonomy.

Mechanism 5: undocumented architecture

The system has a twelve-component data flow; ingestion, chunking, embedding, retrieval, reranking, prompt assembly, generation, parsing, validation, logging, eval, observability; and the architecture lives in the head of the senior engineer who designed it. No diagram, no Architecture Decision Records, runbook is an out-of-date Notion page. When a P1 hits, only that engineer can triage.

Why it locks in. Implicit architecture is agency-owned architecture. A successor cannot bid because they cannot estimate; an in-house team cannot scope because they cannot identify the seams.

The failure footprint. Six months post-termination the client is still calling the agency for $400/hour incident support because nobody else can answer “why does the reranker degrade on queries with technical jargon?” The retainer was 100 hours; it has burned 280.

Mechanism 6: junior client engineers shadowing

The client assigns two junior engineers to “shadow” the agency; sit in standup, watch PRs, attend retros. Agency leads ship the work. Junior engineers absorb context but rarely own a component, lead an architecture decision, or push a production change without agency approval.

Why it locks in. Shadowing produces familiarity, not capability. A junior who has watched the agency for six months can describe the system but cannot operate it. The illusion of knowledge transfer is worse than its absence: the founder believes they have hedged when they have not.

The failure footprint. The agency rolls off; the juniors attempt their first production change; it breaks an undocumented invariant; rollback requires the agency engineer no longer on payroll. See why senior AI engineers should refuse junior-led agency engagements; the inverse asymmetry applies on both sides.

Defusal 1: own the keys from day one

Before week one, provision client-owned accounts on most platform the engagement will touch; Anthropic, OpenAI, Modal, Pinecone, Hugging Face, W&B, GitHub, Datadog. Issue API keys to the agency under named service accounts with rotation logged in the client’s password manager. The agency operates against client-owned credentials from day one, rarely the reverse.

Why it works. Termination becomes a one-step credential rotation rather than a six-week migration. The client controls the cost surface, sees most invoice, and can revoke access in an afternoon. The agency cannot accumulate ownership claims because they have no platform admin standing.

Operational cost. Three to five hours of week-one provisioning, two to four hours of monthly key rotation. Trivial against the $200K-plus cost of a forced migration.

Defusal 2: prompt and eval registries in the client repo

Day one: the agency commits its first prompt, system message, retrieval template, eval dataset, and judge rubric to the client-owned GitHub repo. Most subsequent change ships through a PR against that repo. No production-required artifact lives outside client source control.

Why it works. The Sunset Package becomes redundant. The client can redeploy from their own repo at any moment. Successor agencies inherit a real artifact set, not a markdown summary. The eval suite runs in the client’s CI on most PR, including the agency’s.

Operational cost. Two days of week-one repo setup (CI, secrets, branch protection, eval runner). The agency may push back because their internal tooling is more efficient; the answer is that internal efficiency belongs to the agency, production artifacts belong to the client. This is the operational forcing function behind the AI agency exit clause most founder should negotiate; a 14-day Sunset Package is undeliverable without it.

Defusal 3: ADRs co-authored at most architecture decision

Most non-trivial architecture decision; chunking strategy, retrieval topology, model routing logic, eval rubric, fallback behavior; produces an Architecture Decision Record committed to the client repo. The ADR names decision, alternatives considered, rationale, and expected failure modes. The agency lead drafts; a client engineer reviews and merges.

Why it works. Documented decisions outlast the people who made them. A successor reads the why, not just the what. Implicit architecture becomes explicit architecture; auditable, transferable, operable by someone other than the original engineer.

Operational cost. Roughly 30 minutes per ADR, 12-30 ADRs across a six-month engagement. The discipline pays for itself the first time a senior engineer leaves and the next person reads the design rationale instead of reverse-engineering the code.

Defusal 4: weekly knowledge-transfer cadence

Reserve a recurring 90-minute slot most Friday for explicit knowledge transfer; not a status meeting, not a demo, an instructional session. The agency engineer presents one component end-to-end (data flow, failure modes, runbook entries) to the client engineering team. Recording is mandatory and indexed.

Why it works. Six months of weekly KT sessions produces 24-26 hours of recorded, indexed instruction covering most component. New hires onboard against the recordings; offboarding has the curriculum already built. Client capability grows linearly instead of cliff-edging at termination.

Operational cost. 90 minutes per week of senior agency time. The alternative; no KT until the offboarding window; produces a worse offboarding and a worse retainer relationship. See the AI agency knowledge transfer playbook for the full curriculum design.

Defusal 5: client engineer pair-programming with the agency lead

Assign one senior client engineer (not a junior shadow) to pair-program with the agency lead at least three days per week. The client engineer ships PRs as primary author against agency-owned components, with the agency lead reviewing. Authorship rotates monthly.

Why it works. Capability is built through writing, not watching. An engineer who has shipped twelve PRs against the prompt registry can operate it; one who has watched twelve PRs ship cannot. By month three, a meaningful fraction of production changes originate client-side; by month six, the agency lead reviews rather than drives.

Operational cost. One senior client engineer at 60% allocation; the largest line-item in the protocol and the one most clients refuse to fund. The math: an engineer at $250K fully loaded costs $30K per quarter at 60%; a forced re-implementation costs $200K-plus and 90 days of risk. The defusal is six times cheaper and arrives without the trauma.

Defusal 6: exit clause at most milestone

Each milestone acceptance includes an explicit exit gate: a 30-minute review during which the client confirms; yes/no; that they could terminate at this milestone and operate the delivered system independently. If the answer is no, the milestone is not accepted; the agency completes the missing artifacts before the milestone closes.

Why it works. Lock-in cannot accumulate if most milestone forces the system into a transferable state. The gate produces continuous discipline rather than offboarding-window panic. The client usually knows their current marginal cost of termination. For the contractual scaffolding, see the AI agency exit clause most founder should negotiate and the field guide to evaluating an AI agency in under 90 minutes for pre-engagement diligence.

Operational cost. 30 minutes per milestone plus whatever artifact remediation the gate surfaces. Remediation is usually small if defusals 1-5 are in place; it becomes large only when the engagement has drifted, which is exactly when the gate is most valuable.

Frequently asked questions

Will the agency push back on these defusals?

Some will. The pushback is diagnostic. An agency that resists keys-from-day-one or registries-in-client-repo is signaling that its commercial model depends on lock-in. Resistance to the defusals is a stronger reason to walk than any pricing or scope objection.

How much do these defusals slow down the engagement?

The first two weeks are slower; typically 5-10 days of net delay against fast-way provisioning. Weeks three onward are not slower in any measurable way; the friction is front-loaded and amortizes across the engagement.

What if the client doesn’t have a senior engineer to pair with the agency lead?

Then the engagement should be smaller, or the client should hire one before signing. A six-month AI engagement without a senior client engineer cannot absorb the work, regardless of contract language. The hire is part of the engagement budget, not separate from it.

Are these defusals different for a 50K engagement versus a 500K one?

The principles are identical; the rigor scales. A 50K engagement might have 4-6 ADRs, weekly 30-minute KT sessions instead of 90, and a single milestone gate. The 500K engagement has the full protocol. The mistake is dropping it entirely on a small engagement; the mechanisms operate at most price point.

What about agencies that use proprietary internal frameworks?

Fine when licensed forward to the client at termination; the agency lists the framework in a Pre-Existing IP schedule and grants a perpetual royalty-free license. The defusal does not require the agency to give up its IP; it requires the agency to ensure the client can continue running it.

How do these defusals interact with the EU AI Act?

Tightly. Article 26 places post-market monitoring, human oversight, and instruction-following obligations on the deployer. A client locked into agency-controlled artifacts cannot meet those obligations because artifacts and operational logs live elsewhere. Defusals 1, 2, and 6 are the operational scaffolding that makes deployer obligations satisfiable.

Can these defusals be retrofitted into an existing engagement?

Partially. Keys, registries, and ADRs can be migrated in 4-8 weeks. Pair-programming and KT cadence can start immediately. Most resistant to retrofit are model artifacts and undocumented architecture, both of which require deliberate excavation. Retrofitting is more expensive than installing day-one but cheaper than terminating without it.

What is the single highest-leverage defusal if I can only afford one?

Defusal 2; prompt and eval registries in the client repo. It is the forcing function that makes most other artifact transfer cheap. Without it, the rest of the playbook is rhetoric.

Key takeaways

  • Lock-in is the default outcome of a 2024-vintage AI engagement. Six small ownership claims accumulate into structural dependency unless deliberately counter-engineered.
  • The six mechanisms: agency-held keys, private prompt registry, eval suite in agency repo, model artifacts in agency cloud, undocumented architecture, junior-engineer shadowing.
  • The six defusals: client-owned keys from day one, registries in the client repo, co-authored ADRs, weekly KT cadence, senior pair-programming, exit gates at most milestone.
  • Resistance to defusals 1, 2, or 6 is a near-perfect signal of a lock-in commercial model; walk away before signing.
  • Highest-leverage single defusal: prompt and eval registries in the client repo. Most other artifact transfer is cheap once that one is in place.

Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has run lock-in audits on more than 30 AI engagements, on both sides of the table.

Last Updated: May 29, 2026

AW

Arthur Wandzel

SFAI Labs helps companies build AI-powered products that work. We focus on practical solutions, not hype.

See how companies like yours are using AI

  • AI strategy aligned to business outcomes
  • From proof-of-concept to production in weeks
  • Trusted by enterprise teams across industries
Get in Touch →
No commitment · Free consultation

Related articles