HN's June Thread Split "AI Engineer" Into Two Roles. Your JD Hires Neither.
The June 2026 HN thread exposed two non-overlapping AI Engineer archetypes. Here are the Boolean strings, GitHub signals, and tools to source each.
The top-voted comment in the June 1, 2026 HN "Who is hiring?" thread said the quiet part out loud: "AI Engineer" is now two jobs glued into one job description, and almost nobody does both. A fintech commenter on the same thread admitted they ended up hiring two people for what they had posted as one role. If your JD still reads like a single hire, you are sourcing a unicorn that the market has already split into two distinct candidate pools with different tools, different GitHub graphs, and different on-call expectations.
The split the June thread made undeniable
The HN comment that triggered the discussion drew a clean line. On one side: people who can design evals and (sometimes) fine-tune models. On the other side: people who can ship agent pipelines that survive production. The commenter noted that very few candidates can do both, and the replies piled on with the same observation from infra leads, founding engineers, and fractional CTOs.
This is not a vibes argument. LangChain's "State of Agent Engineering" survey of more than 1,300 practitioners maps the same split into adoption numbers. Observability is now table stakes. Evals are not. Fine-tuning is rarer still.
Read those numbers as a sourcing signal, not a maturity score. Observability is the reliability archetype's home turf. Evals are the IC archetype's home turf. The gap between the two numbers is the gap between the two candidate pools. Writing one JD that demands both is writing a JD that fits neither.
Why the title alone tells you nothing
Refolk's index currently shows roughly 3,700 US profiles with "AI Engineer" exactly in their current title, concentrated in SF and NYC with a long tail at companies like Distyl AI, Playbooks AI, and Develop Health. That number proves the title exists. It does not tell you whether the person on the other end of the cold email writes LLM-as-judge graders in Promptfoo or ships OpenTelemetry semconv PRs to OpenLLMetry. Those two people share a title and share almost nothing else.
This is the core problem with how to source AI engineers in mid-2026: the keyword that should be the strongest filter has become the weakest. You have to source on the second-order signals, and they diverge sharply.
Archetype 1: the evals and fine-tune IC
This is the archetype Hamel Husain has been quietly building a community around for two years. His "AI Evals for Engineers and PMs" course and the upcoming O'Reilly book "Evals for AI Engineers" are the closest thing the field has to a canonical curriculum. Course alumni are a sourcing well. So are contributors to the eval tooling stack.
What their work actually looks like
Anthropic's own engineering guidance is the cleanest description of what this role does day to day. Teams without evals face weeks of regression testing every time a new model drops. Teams with evals upgrade in days. Descript, cited in Anthropic's writeup, built their eval suite around three blunt dimensions: don't break things, do what I asked, and do it well. That is the job. Designing graders, curating datasets, running LLM-as-judge pipelines, and arguing with PMs about what "good" means in a numeric form a regression test can catch.
Quality is the production killer for 32% of teams in the LangChain survey. That single statistic is why the evals IC exists as a distinct role and not as a side quest for a backend engineer.
Boolean signal for the eval archetype
The tooling has crystallized. For LLM evals engineer sourcing, build your strings around the eval-and-dataset stack, not the ops stack:
- Braintrust
- LangSmith (used as an eval surface, not just tracing)
- Promptfoo
- DeepEval
- Inspect (the UK AISI framework)
- Opik
Layer in "LLM-as-judge", "rubric", "grader", "golden dataset", "regression suite", or "pairwise eval". On GitHub, look for contributors to DeepEval, Promptfoo, and Inspect, plus authors of public eval-harness repos. People who write blog posts titled "how we built our eval pipeline" are signaling availability whether they mean to or not.
Red flag: "fine-tuning experience required"
Half the JDs on the June thread demand fine-tuning experience. The LangChain survey says fine-tuning has not been widely adopted. You are asking for evidence the candidate could not have produced at work, which means you are filtering for hobbyists or for people willing to lie. Down-weight it. Ask for eval design instead. The candidate who can describe how they caught a 4-point quality regression with a pairwise grader is worth ten who fine-tuned a Llama variant on a weekend.
Archetype 2: the agent-pipeline reliability engineer
This is the archetype the fintech commenter discovered the hard way after their first prod incident. It is closer to an SRE who learned LLMs than to an ML engineer who learned ops. The work is trace-centric and gateway-centric. The mindset is "what happens at 3am when the tool-call loop wedges on a 429."
What their work actually looks like
The agent reliability stack is OpenTelemetry-native. Laminar, Langfuse, Phoenix, and LangSmith all ingest OTel traces. OpenLLMetry and OpenInference are the vendor-neutral semantic conventions for LLM spans. The reliability engineer thinks in spans, retries, circuit breakers, and gateways. They use Helicone, Portkey, or LiteLLM as the chokepoint they can instrument. They know when a plain deterministic tool beats a model call, which is exactly the phrasing one of the June 2026 JDs used to describe its ideal hire.
The reliability archetype is an SRE who learned LLMs, not an ML engineer who learned ops. Source accordingly.
Boolean signal for the reliability archetype
For agent pipeline engineer hiring, the strings look completely different from the evals strings:
- OpenTelemetry AND ("LLM" OR "agent")
- OpenLLMetry
- OpenInference
- Helicone, Portkey, LiteLLM (gateway layer)
- Langfuse, Laminar, Phoenix (trace ingestion)
- "tool calling" AND ("retry" OR "idempotency" OR "circuit breaker")
- "agent" AND ("on-call" OR "SLO" OR "incident")
Langfuse alone has more than 28,000 GitHub stars under an MIT license. Its contributor and active-issue-commenter list is a credible sourcing pool for reliability-leaning AI engineers who are already comfortable running self-hosted OTel stacks. The Helicone and LiteLLM repos are the same pattern. These are not vanity stars. People who file gateway bugs in production language are people running gateways in production.
The non-obvious sourcing pool
Look at backend platform engineers and SREs who started showing up at AI Engineer Summit or who started contributing to OpenTelemetry's GenAI semantic conventions working group in 2025 and 2026. They did not retitle themselves. Their LinkedIn still says "Staff Platform Engineer" or "Senior SRE." Their last six months of commits say otherwise. This is the pool LinkedIn's title filter actively hides from you.
This is the kind of cross-source query that breaks Boolean. You want SREs whose GitHub activity in the last 12 months crossed into LLM gateways or OTel GenAI work, and whose conference attendance shifted, even though their LinkedIn title has not changed. Which is the friction we built Refolk to remove: you describe that person in plain English and get a ranked shortlist across GitHub, LinkedIn, and the open web, without having to pre-guess which keyword they happened to put on their profile.
The GitHub signal asymmetry
If you remember one thing from this piece, remember this: the two archetypes have completely different commit graphs under the same title.
The eval IC's recent activity clusters in scorer repos, dataset curation tools, and prompt-versioning surfaces. Pull requests to DeepEval, Promptfoo, Inspect. Issues filed on Braintrust SDKs. Public Hugging Face datasets. Blog posts about LLM-as-judge calibration.
The reliability engineer's recent activity clusters in trace ingestion, gateways, and runtime. Pull requests to OpenLLMetry, OpenInference, Langfuse, Helicone, LiteLLM. Issues filed against agent runtimes about tool-call retry semantics. Talks at OTel meetups.
A candidate who lists only the first stack is rarely on-call material. A candidate who lists only the second has probably never written an LLM-as-judge. The overlap exists. It is small. Pricing your one open role as if every applicant sits in the overlap is how you end up six months in with no hire.
How to fix the AI engineer job description
Three changes, all cheap.
1. Split the JD even if the headcount is one
Post two JDs. Title them "AI Engineer, Evals" and "AI Engineer, Reliability" or equivalent. Let candidates self-select. The rare unicorn who can do both reads a combined JD as a wage-suppression signal (one salary for two jobs) and self-selects out. The two specialists you actually need read a combined JD as confused and skip it. Splitting the post costs you a job board fee. Not splitting it costs you a quarter.
2. Replace "fine-tuning required" with "eval design required"
Ask for a writeup of an eval suite the candidate designed, with the specific failure modes it catches. This filters for the IC archetype cleanly. Candidates without that artifact but with strong reliability signals route to the other JD.
3. Source on second-order signals, not titles
Treat "AI Engineer" as a denominator, not a numerator. The 3,700 US profiles with the literal title are a starting set, not a shortlist. The shortlist comes from the tool stack, the GitHub graph, and the conference graph. Refolk is built for exactly this kind of cross-source query: ask for "Langfuse contributors in NYC with prior SRE experience" or "engineers who took Hamel Husain's evals cohort and have shipped a public grader," and the ranking happens against signals, not titles.
The takeaway
The June 2026 HN thread did not invent the AI engineer two archetypes problem. It just made it impossible to keep posting around. The eval IC and the reliability engineer are now two roles with two tool stacks, two GitHub fingerprints, and two on-call profiles. The 89% observability vs 52% evals adoption gap is the market telling you the split is real. The fintech commenter who hired two people instead of one is the market telling you the cost of pretending otherwise.
Write the JD twice. Source on the stack, not the title. And if the budget is one hire, hire the reliability engineer first. The first prod incident always comes before the first quality regression worth catching.
FAQ
Should I hire the evals IC or the reliability engineer first?
If you have agents in production or about to go to production, hire the reliability engineer first. The LangChain survey shows quality is cited as the top barrier by 32% of teams, but observability is what lets you diagnose anything at all. You cannot run evals against incidents you cannot trace. Once trace ingestion and a gateway are in place, the evals IC has a substrate to work against.
Can a strong backend engineer grow into either archetype?
Into the reliability archetype, yes, often in three to six months if they already know OpenTelemetry and have run a production system with retries and circuit breakers. Into the evals archetype, slower. Eval design is a research-adjacent skill (rubric design, inter-rater reliability, dataset curation) that backend engineers rarely have reps in. Hire for it directly or sponsor someone through Hamel Husain's cohort.
What Boolean string filters out the title-chasers?
For the eval archetype, require at least two of: Braintrust, LangSmith, Promptfoo, DeepEval, Inspect, plus the phrase "LLM-as-judge" or "grader." For the reliability archetype, require OpenTelemetry plus one of OpenLLMetry, OpenInference, Helicone, Portkey, LiteLLM, Langfuse, or Laminar. Title-chasers list ChatGPT and "prompt engineering." Practitioners list the tools they actually instrument.
How do I find reliability engineers who still call themselves SREs?
This is where keyword search breaks. Their LinkedIn title has not caught up to their GitHub. The cleanest approach is a cross-source query: SRE or platform engineer title, plus recent contributions to LLM gateway or OTel GenAI repos, plus AI Engineer Summit attendance or talks. That is the exact query shape Refolk was built to answer in plain English instead of seventeen tabs of advanced search.