Refolk
May 19, 2026·9 min read

Decart's $300M Round Hires for One Skill LinkedIn Can't Find

Decart raised $300M on May 18 to scale DOS across NVIDIA, Trainium, and TPU. The engineers who can actually build it don't show up in title search.

CUDA kernel engineer sourcingGPU optimization engineer hiringDecart AI hiringchip abstraction engineersinference optimization talent
Decart's $300M Round Hires for One Skill LinkedIn Can't Find

On May 18, 2026, Decart announced a $300M Series C led by Radical Ventures, with NVIDIA, Amazon, Adobe Ventures, Toyota Ventures, and Andrej Karpathy on the cap table, valuing the two-year-old Israeli company at roughly $4B. The capital funds aggressive engineering hires for DOS, Decart's hardware-agnostic optimization stack that moves models between NVIDIA GPUs, AWS Trainium, and Google TPUs without months of per-chip CUDA work. If you're a recruiter or engineering leader reading this, the question isn't whether Decart can hire. It's whether you can hire the same people before they do.

The honest count of engineers who can ship a production kernel that beats cuBLAS on NVIDIA and also runs cleanly on Trainium is in the low hundreds globally. Every hyperscaler, every frontier lab, and now Decart wants the same names.

What Decart actually bought with $300M

Decart's pitch isn't another foundation model. It's an optimization layer. CEO Dean Leitersdorf and CPO Moshe Shalev have been building DOS since 2023, and the proof points in the funding announcement are infrastructure numbers, not benchmark scores:

  • DOS 2.0 pushes AI agents past 1,600 tokens per second, which Decart claims is eight times the industry average.
  • Lucy2, running on AWS Trainium, exceeds 80% Model FLOPS Utilization. Annapurna Labs VP Nafea Bshara confirmed the number publicly. Most production LLM workloads sit at 30 to 50% MFU.
  • Decart says DOS compresses per-chip optimization from months to weeks, and brings inference cost from "tens to thousands of dollars per hour" down to under $0.25 per hour on their models.
80%
Model FLOPS Utilization on Trainium
Decart's Lucy2 figure, confirmed by Annapurna Labs. Most production LLM workloads run at 30 to 50%.

None of that ships without engineers who can write attention and GeMM kernels from scratch, port them across vendors, and squeeze the silicon. NVIDIA committing capital here is itself the tell. NVIDIA has put more than $40B of AI equity to work in 2026. Funding an explicitly chip-agnostic abstraction layer makes sense only if you'd rather own the cap table of the layer than fight it. Candidates will read that subtext during their offer cycle, and your recruiters should be ready for the question.

Why title search returns nothing useful

Run "CUDA Kernel Engineer" on LinkedIn. You'll get a handful of profiles. Add "GPU Kernel Engineer," "Performance Engineer," and filter for CUDA plus Triton across the US, UK, Canada, Germany, and Israel, and our index surfaces fewer than fifteen globally search-resolvable profiles. They cluster at Modular, Meta, Apple, Wayve, and Mako, mostly in the Bay Area, the UK, and Austin.

That number is wrong, of course. The real population is materially larger. It's just that almost none of these engineers carry the literal title. They show up as "Member of Technical Staff," "Research Engineer," "Software Engineer," "Compiler Engineer," or "ML Systems Engineer." The work product is what identifies them: a merged PR to CUTLASS, a kernel on the KernelBot leaderboard, a talk at GPU MODE, a commit graph on vLLM or TensorRT-LLM, a fused attention kernel in FlashAttention.

This is the actual sourcing problem. Keyword search on LinkedIn is the wrong tool because the keyword isn't in the profile. The artifact is on GitHub, the talks are on YouTube, the leaderboards are on Discord, and the proof is in a commit message from eighteen months ago. Stitching those together by hand takes a senior sourcer a full day per role. It's also exactly the workflow we built Refolk around: describe the engineer in plain English ("shipped a Triton kernel that landed in vLLM or FlashAttention, currently at a US or Israeli AI lab, not a manager"), and get a ranked shortlist across GitHub, LinkedIn, and the open web in one pass.

The four watering holes that actually matter

If you only sourced from these four places for the next ninety days, you'd cover most of the realistic pool.

1. GPU MODE Discord and KernelBot

GPU MODE has 27,395 members. That's the top of the funnel, not the pool. The pool is the active top of the KernelBot leaderboard plus the people who give talks in the lectures repo. Mark Saroufim (Meta/PyTorch, GPU MODE co-founder) runs it. The gpu-mode/reference-kernels and gpu-mode/lectures GitHub repos are public, and the contributor graphs are sourceable. Cross-reference KernelBot rankings with GitHub identities, then resolve to LinkedIn or email.

2. CUTLASS, FlashAttention, vLLM, TensorRT-LLM, Triton

This is the artifact layer. Look at non-trivial merged PRs in the last 18 months. Filter out doc fixes and CI changes. The remaining contributor set, maybe a few hundred people across all five repos with overlap, is the defensible numerator for "CUDA kernel engineer" as a real skill. Tri Dao's FlashAttention work and the Triton ecosystem around it (FlagGems ships 230+ Triton operators focused on LLM training and inference) is where the densest signal lives.

3. Annapurna Labs, Google TPU compiler teams, and Modular

This is the bucket title search will never give you, and it's the one Decart cares about most. DOS is differentiated because it runs on Trainium and TPUs, not only NVIDIA. The rarest hires aren't pure CUDA jocks. They're engineers who've shipped on Neuron SDK, XLA, Pallas, or Mojo. Annapurna alumni in particular are the scarce asset. Bshara's team built Trainium and Graviton, and that team is itself a primary recruiting ground for anyone serious about chip-agnostic inference optimization. Add ex-Google TPU compiler engineers and the Modular team behind Mojo/MAX, and you have the second half of Decart's actual target list.

4. Adjacent JDs that name the skill explicitly

xAI is currently posting a Member of Technical Staff role for "low-level CUDA kernel optimizations" and "high-performance GeMM CUDA kernels using Tensor cores or CUDA cores from scratch or by utilizing CuTe/CUTLASS," Bay Area and Seattle only. Magic posts a Kernel Engineer role. Waymo posts CUDA C++ for perception. These JDs are useful for two reasons: they tell you who you're competing against on comp, and they tell you which candidates have already raised their hands by applying. Anyone interviewing at xAI for this profile in May 2026 is, by definition, in the pool.

The 80% MFU screening question

Most CUDA interview loops devolve into trivia: warp sizes, bank conflicts, shared memory tiling. Useful for a junior screen, useless for identifying the people Decart actually wants. The better filter is the Decart number itself.

An engineer who has pushed a real model past 70% MFU on any accelerator is, by definition, in the pool.

Ask candidates the highest MFU they've personally achieved on a production model, on what hardware, and what the bottleneck was at that ceiling. The answers separate the field instantly. Most strong ML engineers will say 30 to 45%. Strong systems engineers who've actually optimized kernels will say 55 to 65% and explain the memory bandwidth wall. The target hires will say 70%+ and have a war story about overlapping collectives with compute, or a custom attention variant that fused three operators, or a quantization scheme that survived a real eval. You can run that screen in fifteen minutes. It's worth more than a four-hour take-home.

The Israeli pipeline is real and it's about to get expensive

Decart is headquartered in Israel. Leitersdorf and Shalev have built from a Tel Aviv base since 2023. The 8200 and Technion pipeline for systems performance work is a known quantity, and Decart has a structural advantage recruiting from it. If you're a US-based engineering leader trying to compete, two things follow.

First, you need to source the Israeli pool directly, not wait for emigrants to land in the Bay Area. The talent that wrote production CUDA at a defense unit or a Mobileye-adjacent startup in 2022 is now four years senior and exactly the profile. Second, you need a credible answer on remote or hybrid. Decart can offer in-office in Tel Aviv. If you can't match that geography, you compete on scope and comp, and you'd better be specific about both.

27,395
GPU MODE Discord members
The top of the funnel. The realistic pool that has shipped a production kernel beating cuBLAS or a working Trainium kernel is a low three-digit number.

A sourcing plan for the next 90 days

If you're hiring against Decart, or trying to keep your own kernel engineers from getting poached, here's the compressed playbook:

  1. Build a named list of 200 to 300 people from artifact data, not titles. CUTLASS, FlashAttention, vLLM, TensorRT-LLM, Triton contributor graphs, plus KernelBot top rankings, plus GPU MODE speaker lists. This is the universe.

  2. Layer in the chip-agnostic bucket. Annapurna Labs alumni on LinkedIn, ex-Google TPU compiler engineers, current and former Modular employees who shipped Mojo kernels. These names will not appear in any CUDA keyword search, which is the entire point. This is where Refolk's plain-English search earns its keep: you can ask for "engineers who shipped on Neuron SDK or XLA, currently not at AWS or Google" and skip the Boolean gymnastics.

  3. Cross-reference against xAI, Magic, Waymo, and Modular applicant signal where you have it. Anyone interviewing at those JDs right now has self-identified.

  4. Pre-write the comp and scope conversation. The candidates worth chasing will get three offers in two weeks. They'll ask, in order: what's the scope (am I writing kernels or babysitting Triton autotuning), what's the hardware (is this really multi-vendor or NVIDIA-with-a-roadmap-slide), and what's the comp floor. Decart's pitch is "ship across NVIDIA, Trainium, and TPU at 80% MFU." If your pitch is weaker than that on any of the three axes, your close rate will reflect it.

  5. Move in days, not weeks. Decart announced on May 18. Term sheets for the top 30 names on this list are already out. The honest sourcing window before the market resets on comp is roughly two weeks.

The companies that win this round won't be the ones with the biggest recruiting budget. They'll be the ones who figured out, before the announcement, which 200 GitHub handles actually matter and which titles to ignore. Once that list exists, the rest is execution. Refolk's job is to compress getting to the list from a week of sourcer time to a single query.

FAQ

How many CUDA kernel engineers are there really?

If you mean people whose job title literally contains "CUDA Kernel" or "GPU Kernel Engineer," fewer than 15 globally resolve in our index when combined with CUDA plus Triton skills. If you mean people who've shipped a non-trivial PR to CUTLASS, FlashAttention, vLLM, TensorRT-LLM, or Triton in the last 18 months, the number is a low three-digit count with meaningful overlap. The realistic competitive pool for a Decart-tier hire is roughly 200 names worldwide.

Why is NVIDIA investing in a chip-agnostic startup?

NVIDIA has deployed more than $40B of AI equity in 2026, often pairing investment with long-term GPU commitments where some revenue flows back. Funding Decart, which explicitly runs on Trainium and TPUs alongside NVIDIA, looks contradictory until you read it as a hedge. NVIDIA would rather own the cap table of the optimization abstraction layer than be disintermediated by it. Expect candidates to ask about this directly during offer conversations.

What's the single best screening question for this role?

Ask the candidate the highest Model FLOPS Utilization they've personally achieved on a production model, on what accelerator, and what the bottleneck was at that ceiling. Decart's Lucy2 hits over 80% MFU on Trainium. Most production workloads sit at 30 to 50%. Anyone who can credibly describe pushing past 70% has done the work. Anyone who can't, hasn't, regardless of what their resume says about CUDA.

Where should I source if I can't compete with Bay Area comp?

Israel and the UK. Decart's own engineering base is in Tel Aviv, drawing from 8200 and the Technion. The UK has a meaningful cluster around Wayve, DeepMind alumni, and the Cambridge systems community. Both regions have strong kernel and compiler talent at materially lower comp ceilings than the Bay Area, and both are underexplored by US recruiters who default to LinkedIn title search in San Francisco.

Read next