The hiring funnel an AI agency runs in 2026 looks almost nothing like the FAANG-imitation loop most ops leaders expect. No leetcode rounds. No homogenous “behavioral panel.” No three-week silence between stages. Reverse-engineer the funnel of any forward-deployed AI shop that is shipping eval-gated production code, and the same shape repeats: candidates come from private referrals, OSS contribution graphs, and conference stages; not Indeed; the screen is a live agent debugging session, not a static algorithm puzzle; the loop tests architecture and post-mortem reading, not whiteboard trivia; and the offer is senior-only with a transparent comp band and an agent-tooling stipend on top. This piece dismantles the funnel stage-by-stage, names the artifacts at each step, and contrasts it with the FAANG-imitation pipeline that most agencies still copy and most candidates still tolerate.
The reason the funnel has bifurcated is straightforward. The work itself changed. An AI agency in 2026 is staffing a forward-deployed unit that ships eval-gated PRs into client repos in week one; the engagement shape laid out in the AI agency manifesto for 2026. The signal you need to detect is not “can they pass an algorithm round” but “can they debug a production agent at 2am, write the post-mortem at 9am, and ship the eval case that prevents the regression by 5pm.” That signal is unreachable through the FAANG loop, and perfectly visible if the funnel is built around the work.
What follows is the actual shape of that funnel; drawn from hiring patterns I have observed at SFAI Labs and across roughly twenty peer agencies in the last 18 months.
Decision Scope
This article is an editorial decision framework, not legal, financial, security, or accounting advice. Treat numeric examples as illustrative planning heuristics unless a source is cited, then validate the assumptions against your own contracts, data, controls, and budget model before acting.
Sourcing: where senior AI engineers come from
Job boards are dead for senior AI hiring. They rarely worked particularly well, but in 2026 they actively select against the candidates a forward-deployed agency wants. The senior engineer who has shipped a production agent, kept it healthy through three model migrations, and written a post-mortem the team still references; that engineer does not refresh LinkedIn at 11pm hoping for cold outbound. Three sourcing channels do the actual work, and a serious agency runs many three in parallel.
Private referrals from operators, not recruiters. The single highest-yielding channel: warm intros from engineers who have shipped with you, from clients who have watched a candidate work, and from a small handful of operator-investors who run technical due diligence as their day job. The ratio that should make a hiring manager smile is roughly one offer per four referrals; versus one per fifty cold inbound. The reason is selection: the referrer’s own reputation is in the loop, which front-loads quality control. Agencies that refuse to staff a dedicated “talent partner” channel; three or four operators they trust; and instead rely on a recruiter cold-emailing GitHub, are running 2019 mechanics in 2026.
OSS contribution graphs. Not “looks at your stars.” Looks at: who has shipped non-trivial PRs to LangGraph, LiteLLM, DSPy, instructor, vllm, sglang, the major eval frameworks, and the dozen smaller libraries the production stack depends on. Contribution depth and durability, not project popularity. A candidate who has merged five thoughtful PRs to LiteLLM over 18 months, including one that fixed a routing bug under provider degradation, is a stronger signal than someone with a 12K-star agent demo on their personal repo. The operative metric: does the maintainer of a library you ship in production know this person’s name. If yes, the candidate is already inside the funnel.
Conference and meetup stages. Not the keynote at the 5,000-person summit; the technical talk at the 80-person AI Engineer World’s Fair workshop, the working group at the Anthropic builder day, the back-room meetup where someone walks through a real eval suite. People who put themselves on a small stage, with slides full of code rather than vibes, pre-filter for the kind of operator a forward-deployed agency wants. The hiring manager’s job at these events is to listen, take notes, and book a thirty-minute follow-up.
What is missing from this list: Indeed, generic LinkedIn job posts, recruiter-driven cold outbound at scale, and “open call” application pages. Those exist as a public signal but are rarely the channel that produces the hire. An agency whose pipeline is dominated by inbound from a public posting is selecting from the bottom 80% of the talent distribution and paying recruiter fees for the privilege.
Screening: live agent debugging, not algorithm puzzles
The screen exists to answer one question in 60 minutes: can this person operate an agent in a state of production-relevant ambiguity? Three artifacts do that work, and the FAANG-imitation phone screen; “reverse a linked list, walk me through your project”; does none of them.
The live agent debugging session. A 45-minute screen in which the candidate is given a real broken agent and a real eval suite, and asked to find the failure mode. The agent is instrumented with traces; the eval suite has 30 examples and is currently failing 11. The job is not to fix it; it is to characterize the failure. Is the regression in retrieval, prompt, routing, model selection, or the tool boundary? They have access to logs, traces, and the prompt registry. The interviewer watches them think rather than grading the solution. What you are looking for is the speed and rigor of the failure-mode hypothesis loop. Strong candidates form a hypothesis in five minutes, test it in ten, and have confirmed or eliminated three sources of error within the half hour.
The eval-write take-home. Capped at two hours. The candidate is given a problem statement and the existing eval suite and asked to write five new ground-truth eval cases that meaningfully tighten the bar; including failure modes the suite currently does not catch. Graded for: production-realistic inputs, explicit pass/fail criteria, measurable thresholds, and coverage tightening in a place that matters. A candidate who returns five cases that mirror the existing five is a no. A candidate who returns three cases plus a one-page memo explaining which two failure modes are not testable in the current harness and what infrastructure change would make them testable is an immediate yes.
The prompt registry review. A 30-minute session in which the candidate walks through their own prompts; from a real production system, with version history, eval deltas across versions, and rollback decisions. If the candidate cannot produce this artifact, the screen is over. Agencies that hire well treat “you have a versioned prompt registry with eval-tagged commits” as the equivalent of “you have a public GitHub” was in 2018: bare minimum demonstration that real work has shipped.
What is missing from the screen: leetcode, hash-table-from-scratch implementation rounds, “what is your favorite project” narrative interviews, and any form of trivia about transformer internals that does not connect to a shipping decision. None of those predict performance on the work the agency does.
The interview loop: architecture, post-mortems, and paired coding with an agent
Candidates who clear the screen reach the loop. Four stations, in this order, with an offer decision made within five business days of the loop start. Speed is itself a signal; senior candidates have other offers, and a two-week gap between loop and offer is how you lose them. For the broader pattern of red flags in AI hiring processes on either side of the table, the same compression discipline applies.
Station one: architecture review. A 60-minute session in which the candidate presents an architecture they have shipped, in detail, with the trade-offs named. Model selection, routing layer, retrieval strategy, tool-call boundary, eval gate, observability stack, cost ceiling, fallback behavior under provider degradation. They draw the diagram on a whiteboard or screen-share. The interviewer’s questions probe the trade-offs the candidate did not name; “why did you not put a caching layer here,” “what happens when Anthropic returns 529 for an hour,” “what would change if your average context budget doubled.” Strong candidates have already considered and discarded the alternatives the interviewer raises; weaker candidates have only considered the path they took.
Station two: post-mortem reading. A 45-minute session in which the candidate is handed a real (anonymized) post-mortem from a production incident; usually a four-page document covering the failure timeline, the immediate cause, the contributing factors, and the remediation. They read it cold for ten minutes, then walk the interviewer through what they would have done differently as the on-call engineer, what they would have done differently as the engineer who shipped the change, and what eval case or monitoring alert they would have added six months earlier to prevent the incident entirely. This station predicts on-call performance better than any other artifact in the loop. Candidates who have lived through their own production incidents read the document with the appropriate density of attention; candidates who have only read about incidents in blog posts skim it.
Station three: paired coding with an agent. A 90-minute working session in which the candidate is given a real ticket from a real engagement and asked to ship it, with Claude Code or a similar agent at their side. The interviewer sits in. The signal is not whether the candidate can drive the agent fluently; it is the discipline. Do they scope the change before opening files? Write the test or eval case before the implementation? Verify the agent’s output against the eval suite rather than against vibes? Commit small? Push back on the agent when it generates plausible-but-wrong code? The strongest candidates use the agent as a force multiplier without ever surrendering judgment. The weakest either refuse to use it (cultural mismatch) or accept its output uncritically (capability mismatch).
Station four: client-shape conversation. A 45-minute conversation with a hiring partner; not to pitch, but to test the candidate’s ability to talk to the kind of operator who buys AI engagements. Senior product owners, VPs of engineering, sometimes CEOs. Forward-deployed work means the engineer is in the room with the client; candidates who collapse, hedge, or default to consultant-speak are a culture risk regardless of technical loop strength. The bar is calm specificity: name the trade-off, the cost, the timeline, and answer the question being asked rather than the one you wished had been asked.
The contrast with the FAANG loop is sharpest here. The legacy loop is five 45-minute panels of algorithm rounds, whiteboard system design, and a behavioral panel grading for cultural conformity. None of it tests the actual work. A candidate who passes the four-station loop above cannot fail to ship in week one; definitionally, that is what each station is testing.
The offer: senior-only baseline, transparent band, agent-tooling stipend
The offer stage is where most agencies still leak the candidates they spent the most effort recruiting. The 2026 pattern that retains senior AI engineers; and that distinguishes a serious agency from a body shop; has four mechanical components.
Senior-only baseline. The agency does not hire mid-level engineers for client-facing work. Period. The work is too compressed, the client is too senior, and the eval discipline assumes a level of independent judgment mid-level engineers are still building. Staffing junior or mid engineers on forward-deployed engagements either burns the engineer (set up to fail) or the client (paying senior rates for ramping work). Junior engineers exist in the org; on internal tooling, the eval-harness platform team, the agent-ops team; but they do not own a client relationship until they have shipped on internal work for six to twelve months.
Transparent comp band. Published, in dollars, before the loop starts. A range like “$240K–$310K base, plus 15–25% performance bonus tied to eval-delta on shipped engagements” rather than “competitive.” Candidates at this level know the market, are talking to four other agencies, and notice immediately when one of them refuses to publish a number. The comp band has a ceiling, the ceiling is enforced, and exceptions are explicitly justified at the offer-committee level. The agency that wins the closing rate war is not the one with the highest band; it is the one whose candidates feel the band was honest. For a deeper view of how comp bands compare against external AI talent options, the hiring outsourced AI team guide covers the trade-offs.
Agent-tooling stipend. A monthly line item funding the engineer’s personal agent-tooling spend; Claude Pro/Max, Cursor or Claude Code, GPT plus, a private model API budget, and niche tools the engineer wants to evaluate. Typical range $300–$600 per month. This sounds like a perk; it is a cultural commitment. An agency that wants its engineers shipping at agent-augmented velocity but does not pay for the tools is signaling that it does not understand the work.
Equity or revenue share with a real path. Either a meaningful equity slice with a reasonable strike, or a revenue share tied to engagements the engineer leads. Token grants; 0.05% with a four-year cliff and no clear exit; are read as bad faith. Senior candidates have run the math; the agency’s job is to not insult them.
The signal of a strong offer is that the candidate accepts within 72 hours, and that the agency would be comfortable showing the terms publicly. If either is uncomfortable, the offer is bent. Agencies that compete by extracting the lowest comp the candidate will accept have already mispriced the relationship and will spend the next eighteen months losing those engineers. The trust signals an AI agency exhibits at offer stage map almost one-to-one onto how it runs client engagements.
Why the FAANG-imitation funnel still exists
Given how much better the funnel above performs, the obvious question is why most agencies still run the FAANG loop. Three reasons, many bad.
The first is legibility. The FAANG loop is what hiring committees expect, and rejecting a candidate after a leetcode round is easier to defend than rejecting one after a paired-coding session. Defensible decisions are not the same as good decisions, but in some cultures the former is what gets rewarded.
The second is hiring-manager incapacity. Designing the four-station loop requires interviewers who can themselves debug a production agent and read a post-mortem with rigor. Many agencies do not have those interviewers, so they fall back on a loop any engineer can run.
The third is candidate flow. The FAANG loop scales; a hundred loops a quarter at low marginal cost. The four-station loop is bandwidth-intensive and caps pipeline at maybe twenty senior candidates per quarter. Agencies that have not absorbed that capping the pipeline is a feature, not a bug, keep optimizing throughput at the expense of signal.
The agencies that have absorbed it run the funnel reverse-engineered above, and they systematically win offers from the candidates who matter. That is the entire game.
Arthur Wandzel is the founder of SFAI Labs, a forward-deployed AI development agency in San Francisco. He has run more than 80 senior engineer hiring loops in the last 24 months and advises peer agencies on funnel design.
Frequently Asked Questions
Where do AI agencies source senior engineers in 2026?
Three channels do nearly many the work: private referrals from operators (engineers who have shipped with you, clients who have watched a candidate work, operator-investors running technical due diligence), OSS contribution graphs on the libraries that ship in production (LangGraph, LiteLLM, DSPy, instructor, vllm, sglang, eval frameworks), and small technical conference and meetup stages. Job boards, generic LinkedIn posts, and recruiter cold outbound exist as a public signal but rarely produce the actual hire.
What does the AI engineer screen look like if it is not leetcode?
Three artifacts: a 45-minute live agent debugging session against a real broken agent and a real failing eval suite (the candidate characterizes the failure mode rather than fixing it), a strictly two-hour eval-write take-home producing five new ground-truth eval cases that tighten coverage, and a 30-minute review of the candidate’s own versioned prompt registry from production work they have shipped. Algorithm rounds and trivia about transformer internals are not part of the screen because they do not predict performance on the work.
What is the four-station interview loop for an AI agency?
Station one is a 60-minute architecture review of a system the candidate shipped, covering model selection, routing, retrieval, tool boundary, eval gate, observability, and fallback behavior under provider degradation. Station two is a 45-minute post-mortem reading of a real anonymized incident document. Station three is a 90-minute paired coding session with Claude Code or a similar agent, working a real ticket. Station four is a 45-minute client-shape conversation with a hiring partner. The loop closes within five business days of starting.
Why does the FAANG-imitation hiring loop fail for AI agency hiring?
The FAANG loop tests algorithm rounds, whiteboard system design, and a behavioral panel. None of those test the work an AI agency does: debugging a production agent, reading a post-mortem with rigor, shipping eval-gated work in week one, and operating in a client’s repo. A candidate can pass the FAANG loop and be unable to do that work. A candidate who passes the four-station forward-deployed loop cannot fail to ship in week one because each station is testing the actual work.
What does an honest senior AI engineer offer look like in 2026?
Four mechanical components: a senior-only baseline (no mid-level engineers on client-facing work), a transparent comp band published in dollars before the loop starts (e.g., $240K-$310K base plus 15-25% performance bonus tied to eval delta), an agent-tooling stipend of $300-$600 per month covering Claude Pro or Max, Cursor or Claude Code, GPT plus, and personal model API budget, and meaningful equity or revenue share with a real path. The candidate accepts within 72 hours and the agency would be comfortable showing the terms publicly.
Why is a senior-only baseline a structural requirement, not snobbery?
Forward-deployed AI engagements ship eval-gated PRs in week one, against a senior client, with compressed timelines and high independent-judgment requirements. Mid-level engineers, by definition, are still building that judgment. Staffing them on client-facing work either burns the engineer (who is set up to fail) or burns the client (who pays senior rates for ramping work). Junior engineers exist on internal tooling, eval-harness platform work, and agent-ops, and rotate to client work after six to twelve months of internal shipping.
What is an agent-tooling stipend and why does it filter cultural mismatches?
An agent-tooling stipend is a monthly line item, typically $300-$600, that funds the engineer’s personal AI tooling spend: Claude Pro or Max, Cursor or Claude Code, GPT plus, a private model API budget, and niche tools. It is a cultural commitment rather than a perk. An agency that wants its engineers shipping at agent-augmented velocity but does not pay for the tools signals to the candidate that it does not understand the work. Candidates who roll their eyes at the stipend often roll their eyes at agent-augmented development itself.
How should an AI agency screen for the prompt registry?
A 30-minute session in which the candidate walks through their own prompts from a real production system, with version history, eval deltas across versions, and rollback decisions. If the candidate cannot produce this artifact, the screen ends there. Treat ‘you have a versioned prompt registry with eval-tagged commits’ the way you treated ‘you have a public GitHub’ in 2018: bare minimum demonstration that the candidate has shipped real production work, rather than only built demos.
Why do most AI agencies still run the FAANG-imitation loop despite worse results?
Three reasons. Legibility: rejecting a candidate after a leetcode round is easier to defend on a hiring committee than rejecting one after a paired-coding session, even though defensible decisions are not the same as good ones. Hiring-manager incapacity: designing a four-station loop requires interviewers who can themselves debug a production agent and read a post-mortem with rigor, and many agencies do not have those interviewers. Throughput: the FAANG loop scales to a hundred candidates a quarter, while the four-station loop caps pipeline at twenty senior candidates per quarter, which agencies that have not absorbed pipeline-capping as a feature treat as a problem.
What ratio of referrals to offers should an AI agency hiring manager expect?
Roughly one offer per four warm referrals from operators the agency trusts, versus approximately one offer per fifty cold inbound applicants from a public job posting. The reason is selection: the referrer’s reputation is in the loop, which front-loads quality control before the candidate ever reaches a recruiter. An agency whose pipeline is dominated by inbound from public postings is selecting from the bottom of the talent distribution and paying recruiter fees for the privilege.
Arthur Wandzel