Refolk
June 30, 2026·9 min read

June's HN Thread Split "AI Engineer" in Two. One JD Won't Hire Either

The June 2026 HN "Who is Hiring" thread exposed a JD defect: AI Engineer is two jobs. Here's how to source both as parallel searches.

AI engineer job descriptionsourcing AI engineersagent pipeline engineerLLM evaluation hiringhire AI engineer 2026
June's HN Thread Split "AI Engineer" in Two. One JD Won't Hire Either

If you posted an "AI Engineer" role this month and got a hundred applicants but the right person for none of them, the June 2026 Hacker News "Who is Hiring" thread (item 48357725) just explained why. A top commenter and a half-dozen staffing write-ups in the same two-week window called it out: the title now covers two structurally different jobs, and one job description cannot recruit both.

The pattern that surfaced is simple. One archetype evaluates and fine-tunes models. The other ships reliable agent pipelines in production. A fintech in the thread admitted it hired what it thought was one person and ended up hiring two, because nobody priced the infra and reliability work until the first prod incident. The fix is not a better JD. The fix is two parallel searches with two distinct skill signals and, often, two different title queries entirely.

What the June HN thread actually showed

The "Who is Hiring?" thread for June 2026 reads differently from a year ago. Postings cluster around agentic dev environments, LLM product surfaces, evaluation work, prompt and retrieval roles, and what one trends recap called "making AI survivable inside real business workflows." That last phrase is the tell. Survivable means production. Production means the bottleneck is reliability, not modeling.

Companies hiring in the thread split cleanly along that line. Carma (YC W24) wants someone who can run production agent workflows over messy telematics data. Pango is recruiting for its "Agentic Operating System for e-commerce logistics," which is a systems job dressed in agent vocabulary. Hatchet, also in the thread, builds the job orchestration layer those agents run on. EggAI and Ashby are hiring senior generalists who can build AI features end to end, which in practice means the pipeline side most days and the model side rarely.

Meanwhile, the actual eval and fine-tune work shows up under different titles. ML Engineer. Applied Scientist. Research Engineer. Applied AI Engineer at Anthropic, where Forward Deployed Engineer and Applied AI Engineer are used interchangeably. The "AI Engineer" title itself has drifted toward the pipeline archetype, and sourcing through it alone will leave the model side empty.

The two archetypes, named

Archetype 1: the eval and fine-tune engineer

This person owns the model. They run the eval-train-deploy loop. Fireworks made the case in April 2026 that this loop is the discipline: teams that ship "collapse the eval-train-deploy cycle from days of fragmented tooling into a tight loop measured in hours and stopped treating fine-tuning as a single technique." SFT, RFT, DPO, and GRPO all run for the same product. LoRA when the data is thin. DeepEval, eval-protocol, and MLflow agent tracing for the measurement side. The Anthropic evals guide is required reading.

Background lineage: PyTorch, training loops, often a research or applied ML resume. They came up doing model work before LLMs ate the category. If your problem is "the model is wrong in a domain-specific way and we have proprietary data," this is the hire.

Archetype 2: the agent pipeline engineer

This person owns the system around the model. LangChain or LangGraph for orchestration. Temporal or Hatchet for durability. OpenTelemetry for observability. Claude Code or Cursor inside the dev loop. They think in retries, idempotency, tool-call schemas, and graceful degradation when an LLM hallucinates a function name. The reasoning layer plans and chooses tools. The action layer calls APIs and loops until the task completes or fails. The agent pipeline engineer is responsible for what happens when that loop misbehaves at 3 a.m.

Background lineage: backend, distributed systems, SRE, sometimes platform engineering, with LangChain or LlamaIndex layered on in the last 18 months. If your problem is "the prototype works in a notebook but falls over the second a customer touches it," this is the hire.

Why one JD attracts a hundred wrong applicants

The staffing-firm post that circulated two weeks ago put it bluntly: the posting "borrows half its bullets from a 2020 data science template and the other half from a LinkedIn thread about agents, attracts a hundred applicants and the right person for none of them, because the posting is describing two different jobs at once."

The Irvine fintech case in the same post is the canonical failure. A company hired what they called an ML engineer to build an "AI assistant for our compliance team." Strong PyTorch background. The actual job was a retrieval system over ten years of regulatory filings. The hire kept reaching for a fine-tune when sharper retrieval would have shipped in a fraction of the time. The HN fintech anecdote is the same failure mode in reverse: they wrote one JD, hired the model-side person, hit the first prod incident, and had to go back out for a pipeline engineer.

729%
Forward Deployed Engineer postings, April 2025 to April 2026
Indeed data via Business Insider: 643 to 5,330 in twelve months. The adjacent role is absorbing the pipeline archetype faster than "AI Engineer" titles can keep up.

The LangChain 2026 State of Agent Engineering survey (n=1,300+) sharpens the diagnosis. 57% of respondents have agents in production. 89% have implemented observability. Only 52% have implemented evals. That 37-point gap between instrumenting and measuring is the seam between the two archetypes. The pipeline engineer instruments. The eval engineer measures. Writing one JD that demands both buries the actual seniority signal, because almost nobody is genuinely senior in both at the same time.

The same survey reports that 57% of organizations are not fine-tuning at all. They run base models with prompt engineering and RAG. So listing "fine-tune LLMs" as a requirement on a generic AI Engineer JD does two harmful things at once: it filters out qualified pipeline engineers who have never needed to fine-tune, and it puts you in competition for a small expensive pool you may not actually use.

Source them as two parallel searches

Stop trying to write the unified posting. Run two funnels. The skill signals are different, the title queries are different, and the rejection reasons should be different too.

For the agent pipeline engineer

Search "AI Engineer" plus LangChain, LangGraph, Temporal, or Hatchet. Look for OpenTelemetry, observability tooling, and prior backend or SRE work. The Forward Deployed Engineer skill mix from a 1,000-posting analysis is a useful template: Python (66%), AI agents (35%), TypeScript (35%), AWS (32%), LLMs (31%). Note JS and cloud, not PyTorch. Title cousins to include: Forward Deployed Engineer, Applied AI Engineer, Solutions Engineer at AI-native companies (Sierra, Harvey, Decagon, Cognition).

A natural-language query beats a Boolean here, because the signal is "has shipped agents to prod and owned them through an incident," which is hard to express as a keyword string. This is the friction we built Refolk to remove: you describe the person in plain English ("AI engineers in SF or NYC who have shipped LangGraph or Temporal agents to production at a Series A or later startup, with backend or SRE background") and get a ranked shortlist across GitHub, LinkedIn, and the open web.

For the eval and fine-tune engineer

Do not search "AI Engineer." The title is mostly capturing the pipeline archetype. Search ML Engineer, Applied Scientist, Research Engineer, or Applied AI Engineer. Skill signals: PyTorch, SFT, DPO, RFT, GRPO, LoRA, eval-protocol, DeepEval, MLflow, the Anthropic evals guide vocabulary, the Fireworks Training API. Background lineage matters more than current title, because the people who can do this work were often hired into a generic "AI Engineer" slot a year ago and are mis-labeled in their own profiles.

A useful asymmetry to know before you start: when we filter current "AI Engineer" titles in the US by LangChain, roughly 425 profiles return, concentrated in SF Bay, Boston, Chicago, Atlanta, and NYC, at employers like PostHog, Tellius, ZoomInfo, and Verizon. When we filter the same title by fine-tuning skills, the set is nearly empty. The eval and fine-tune profile is roughly an order of magnitude rarer through that title filter. You have to search a different title bucket entirely to find them, which is precisely what sourcing AI engineers as one role gets wrong.

89% vs 52%
Observability adoption vs evals adoption in agent teams
LangChain State of Agent Engineering 2026, n=1,300+. The 37-point gap is the seam between the pipeline hire and the eval hire.

The breaking point Anthropic named

Anthropic's June 2026 agent-evals guide describes the moment the unified JD becomes untenable. "Once an agent is in production and starts scaling, building without evals starts to break down. The breaking point comes when users report the agent feels worse after changes and the team is flying blind." A separate practitioner write-up from May 2026 adds that 74% of organizations rely primarily on human evaluation for AI agents, because automated metrics have systematic blind spots.

Translate that to hiring. Your pipeline engineer can get you to production. They cannot tell you whether the agent is getting worse, because that requires an eval harness, a labeling protocol, and someone who treats regressions as a measurement problem rather than an incident. The eval engineer can build that harness but will not enjoy owning the on-call rotation for a LangGraph deployment. These are not the same job, and the people who do them well rarely overlap in temperament, let alone skill.

The pipeline engineer instruments. The eval engineer measures. Almost nobody is genuinely senior in both at the same time.

What to do this week

Rewrite the JD as two postings. Two titles. Two skill sections. Two interview loops. The pipeline loop should include a system-design round on retries, idempotency, and tool-call failure modes. The eval loop should include a take-home or live exercise on building a small eval set and arguing for or against a fine-tune given the data.

Then source them as two searches. If you are using a Boolean string, expect to write two. If you want to skip the Boolean and ask in plain English, that is exactly what Refolk is for: one query for the agent pipeline engineer, a separate query for the eval and fine-tune engineer, each tuned to the title cousins and skill signals above. The point is not the tool. The point is to stop pretending one funnel can produce both.

The HN commenter's fintech got there the expensive way: hire one, hit the incident, hire the second. The cheap way is to admit in the JD that you are hiring two people, or that you are hiring one and naming which half you are deferring. Either is honest. The unified posting is not.

FAQ

Is "AI Engineer" still a useful title to post under?

Yes, for the agent pipeline archetype. The title now reliably attracts engineers with LangChain, LangGraph, and production-agent experience, and the candidate pool understands the pipeline framing. Use it for that role and write a second posting under ML Engineer, Applied Scientist, or Applied AI Engineer for the eval and fine-tune role. Posting both under "AI Engineer" is what produces the hundred-wrong-applicants problem.

Do I really need to hire two people, or can a senior generalist cover both?

Senior generalists exist but are rare and expensive, and most of them are already at Anthropic, OpenAI, Sierra, Harvey, Decagon, or Cognition. For a Series A or B company, the realistic plan is to hire one archetype first based on which failure mode will hit you first. If you are pre-production, hire the pipeline engineer. If you are post-production and users are reporting that the agent "feels worse," hire the eval engineer immediately. Anthropic calls that the breaking point for a reason.

How do I source the eval and fine-tune engineer if the title doesn't match?

Search adjacent titles (ML Engineer, Applied Scientist, Research Engineer) and filter on skill signals: PyTorch, SFT, DPO, LoRA, DeepEval, MLflow, and exposure to the Fireworks or Anthropic eval tooling. Background lineage (training loops, research code, applied ML at a previous role) matters more than current job title because many of these people are mislabeled in their own profiles. Ask for the search in plain English rather than rebuilding a Boolean every time the title taxonomy shifts.

What's the cleanest signal that a candidate is the pipeline archetype and not the model archetype?

Ask them to describe the last time an agent failed in production and what they changed in the system, not the model. The pipeline engineer will talk about retries, tool-call schemas, observability gaps, and orchestration. The model engineer will reach for prompt changes or a fine-tune. Both answers are valid, but they tell you which role you are actually interviewing for, regardless of what the resume says.

Read next