Refolk
June 16, 2026·8 min read

June's HN Thread Split "AI Engineer" in Two. Your JD Hires Neither.

The June 2026 HN hiring thread exposed two AI engineer archetypes. Here is how to split the req, rewrite the JD, and source against both.

AI engineer job descriptionhiring AI engineers 2026agent pipeline engineerLLM eval engineerAI engineer archetypes
June's HN Thread Split "AI Engineer" in Two. Your JD Hires Neither.

The top comment on the June 2026 Hacker News "Who is hiring?" thread did your hiring committee a favor. It pointed out, with receipts from a dozen JDs in the same thread, that "AI engineer" is now two jobs taped together: one person who evals and fine-tunes models, and another who ships reliable agent pipelines in production. A fintech commenter underneath admitted they had budgeted one headcount, hit their first prod incident, and ended up hiring two.

If your open req says "AI engineer" and lists both "design eval rubrics" and "own agent reliability in prod," you are about to repeat that mistake. The candidates who can do both at a senior level effectively do not exist in volume. The market has already split. Your job description has not.

The split is real, and the data backs it

LangChain's State of Agent Engineering 2026 (1,300+ respondents) is the cleanest snapshot. 57% of teams have agents in production. 32% name quality as the top barrier. And there is a yawning gap underneath those headlines: about 89% have implemented observability for their agents, while only 52% have implemented evals.

37 pts
Gap between agent observability adoption (89%) and eval adoption (52%)
Most teams instrumented the pipeline before they ever wrote a rubric. Your JD probably reverses that order.

That 37-point gap is the entire argument. The teams shipping agents wired up traces, retries, and budget controls first, because without those, nothing else matters. Evals came later, and for many teams, are still coming. When you write a JD that leads with "design and own our evaluation framework," you are screening for the work that is downstream of work most teams have not done yet.

Datadog's State of AI Engineering report makes the same point from the other direction. In March 2026, 2% of all LLM spans returned an error, and rate-limit errors were nearly a third of those, roughly 8.4 million rate-limit errors in a single month of telemetry. The dominant production failure mode is not "the model got the answer wrong." It is provider capacity, retry storms, and backpressure. That is distributed-systems work. The LLM is just the executor.

Archetype 1: the LLM eval engineer

This is the person most managers picture when they write "AI engineer." They know how to:

  • Design rubrics that other annotators can follow without drifting.
  • Build golden sets, slice them by failure mode, and re-baseline when the model rev ships.
  • Run pairwise comparisons, calibrate LLM-as-judge against human labels, and detect when the judge starts hallucinating agreement.
  • Fine-tune (LoRA, DPO, or full SFT depending on stack) and know when not to.
  • Read a confusion matrix and tell you which 4% of inputs are dragging your aggregate score.

The non-obvious part: the best eval engineers are closer to senior QA leads with domain depth than to ML PhDs. Rubric writing is a writing skill and a domain skill before it is a statistics skill. Engineering managers screening for "ML background" routinely filter out the exact candidates who would have shipped a usable eval harness in six weeks.

If your product is in a regulated or specialized domain (fintech, clinical, legal, ops), the eval engineer is the one who will keep your model from confidently lying to a customer. Distyl AI, Hedgineer, and Tradeify shaped roles all live or die on this person.

Archetype 2: the agent pipeline engineer

This is the role most JDs underweight, and the one that ate the fintech commenter's quarter. The job is to treat an agent as what it actually is in production: a distributed system where an LLM happens to be the planner and executor. The skill stack:

  • At least one agent framework in anger. In 2026 that means Anthropic's SDK, OpenAI's Agents SDK, LangGraph, or Vercel's AI SDK. Not a demo. A repo with retries, idempotency keys, and a dead-letter queue.
  • Strict tool contracts. Deterministic state transitions where possible. Tool calls that fail closed.
  • Trace-level observability wired from day one (OpenTelemetry, Langfuse, Arize, Datadog LLM Observability, take your pick).
  • Rate-limit accounting and capacity engineering. Token budgets per request, per tenant, per minute. Circuit breakers when a provider degrades.
  • A working theory of evaluation in CI, even if the eval engineer owns the rubrics.

This person is a backend engineer with cloud chops who has spent the last 18 months in the LLM stack. They are not an "ML person." Screening them with ML interview loops will fail them on signal they do not need and pass them on signal you do not care about.

The agent pipeline engineer is a distributed systems hire who happens to call LLMs. Stop interviewing them like a model researcher.

The screening question that separates them

Robert Ardell at KORE1's AI/ML staffing practice has a line worth stealing: the hire fails when a company screens an agentic engineer like a regular LLM developer and never asks how their last agent behaved at 2 a.m. under load. Steal it. Put it in your phone screen. The candidates who light up at that question are the ones you want. The ones who pivot to "well, our eval scores were 92%" are the other archetype, which is fine, but you are interviewing for the wrong role.

Why benchmarks are lying to your hiring loop

There is a load-bearing piece of academic work here every manager writing an AI req should at least skim. ReliabilityBench (arXiv 2601.06112) found that if a benchmark reports 90% accuracy, you should expect 70 to 80% in production once you account for consistency and faults. A follow-up paper across 23,392 episodes (arXiv 2603.29231) showed that reliability, defined as consistent success across repeated invocations, degrades super-linearly with task complexity, and that pass@1 metrics on short atomic tasks make this completely invisible.

Translated to hiring: a candidate who walks in with a working agent demo has shown you capability, not reliability. Ask for variance across 100 runs. Ask for the meltdown point. Ask what the retry policy was, and what they did when the underlying provider rate-limited them mid-trajectory. If they cannot answer in that vocabulary, they are an Archetype 1 candidate who built a cool demo, not an Archetype 2 hire.

What this means for your JD and your sourcing

The supply is thin. Our internal index currently shows roughly 4,019 US profiles holding titles like AI Engineer, Applied AI Engineer, LLM Engineer, or Agent Engineer, heavily clustered in the SF Bay Area and NYC. LinkedIn's 2025 Jobs on the Rise report ranks "AI Engineer" the fastest-growing US title for the second year running, with 75,000 roles added between 2023 and 2025. The req volume is enormous. The qualified pool, split correctly into two archetypes, is small.

Comp reflects this. Mid-to-senior agentic AI engineers run $155K to $265K base in 2026, with top performers clearing $400K total comp. Well-defined searches close in 5 to 9 weeks. Vague ones do not close.

Title search is what produces the unicorn-shaped Venn diagram. "AI Engineer" as a LinkedIn filter returns a pile of generative-AI generalists alongside the two archetypes you actually want, with no way to tell them apart from the title alone. This is the exact friction Refolk was built for: you describe the person in plain English ("backend engineer who has wired up production agent retries and written about it") and get a ranked shortlist across GitHub, LinkedIn, and the open web. The eval engineer has a different prompt: "QA lead or applied scientist with domain depth in fintech who has published rubrics or annotation guidelines."

Two prompts. Two shortlists. Two reqs.

Rewriting the JD

A few concrete edits that will pay for themselves in the first interview loop:

  1. Split the req. Two JDs, two loops, two hiring managers if possible. Stop trying to hire one person.
  2. Name the framework. "Experience with LangGraph, OpenAI Agents SDK, Anthropic SDK, or Vercel AI SDK" closes searches 5 to 9 weeks faster than "experience with LLMs."
  3. For Archetype 2, lead with reliability vocabulary. Retries, idempotency, backpressure, traces, tool contracts. If the JD reads like a backend role with an LLM section, you are calibrated correctly.
  4. For Archetype 1, lead with domain and rubric writing. Not "PhD preferred." Senior QA leads and domain experts who can write are your best candidates.
  5. Drop "fine-tune" from Archetype 2 entirely. It is not their job. If you keep it, you will reject the right people.

The "forward-deployed" red herring

One last trap. The role is fragmenting publicly into LLMOps Engineers, Evals Engineers, context engineers, AI Data Engineers, and Forward-Deployed AI Engineers. Do not conflate the FDE split with the eval/pipeline split. FDE is an operational and contextual distinction (do they sit with the customer?). The HN thread's split is technical (do they own the model or the system?). A forward-deployed AI engineer can be either archetype. Mashing the two axes into one JD is how you end up with a four-skill unicorn nobody can fill.

The June thread did the diagnostic work for free. Two archetypes. Two JDs. Source against both. The teams that figure this out in Q3 will close their AI hires before the teams still posting one combined req finish their second round of phone screens.

FAQ

Can one person ever do both jobs?

A handful of senior people can, and they are not on the market. If you find one, you are not hiring them at your posted band. For 95% of teams, the realistic path is two hires (or one hire plus a contractor on the side you are weaker on). The fintech commenter on the June HN thread is the median case, not the exception.

What if we only have headcount for one?

Hire the agent pipeline engineer first if you have agents in production or plan to within two quarters. The LangChain data (89% observability adoption vs 52% eval adoption) suggests this is what teams actually do, even if their JDs say otherwise. The eval engineer has more leverage once the pipeline is instrumented and you have real traces to evaluate against. Hiring evals-first into an uninstrumented stack is how rubrics end up measuring noise.

How do we screen Archetype 2 without an ML interview loop?

Use a distributed systems loop with one LLM-aware round. Ask about a real incident from their last agent system. Ask for the retry policy, the rate-limit handling, the tool contract design. Ask the 2 a.m. question. If they cannot draw the trace of a failed agent run on a whiteboard, they have not shipped one in production, regardless of what their resume says.

Where are these candidates actually findable?

GitHub commit history on agent framework repos and adjacent tooling is the highest-signal channel for Archetype 2. Conference talks, internal eng blogs, and rubric writeups are the channel for Archetype 1. LinkedIn title search is the worst channel for both, because the title is doing none of the work. This is the gap Refolk closes: ask in plain English for the specific behavior (shipped retries on LangGraph, wrote a public eval rubric for a regulated domain) and get the people, not the title.

Read next