ML Engineer on the research team at Mercor. Three problems consumed most of my time: helping build RLHF pipelines, helping replace human-heavy QA with LLM judges, and helping squeeze latency out of LLM inference.
RLHF pipelines
The problem with RLHF at scale is deceptively simple: you're trying to use human judgment to improve model outputs. But human judgment is noisy, inconsistent, and expensive. Your reward model drifts. Your evaluators disagree. The distribution of tasks you're optimizing for keeps shifting under you.
Most teams focus on the modeling side — PPO vs DPO, reward model architecture, training stability. The real leverage is in the data. Which human judgments do you trust? How do you weight conflicting signals? How do you detect when your reward model is gaming the metric instead of genuinely improving?
I helped build infrastructure to curate, weight, and validate human feedback at scale — the systems that decide which feedback to learn from and which to ignore. The model is often the easy part. The hard part is the data pipeline that feeds it.
LLM as a judge
Once the RLHF pipeline was working, the bottleneck moved to QA. Every model iteration needed thousands of human evaluations to know whether it was actually better. That doesn't scale and it doesn't run overnight.
I helped implement an LLM-as-a-judge layer to replace most of that work. The hard part isn't writing a rubric and asking a model to score — it's making the judge agree with humans often enough that you can trust it. That meant calibrating against held-out human-labeled sets, controlling for position and verbosity bias, and ensembling across prompts when single-judge variance was too high.
The payoff was a feedback loop that ran in minutes instead of days. Engineers could ship a candidate model, get a quality read by lunch, and iterate the same afternoon.
Inference optimizations
The third problem was cost and latency. The judge layer, the eval pipelines, and production traffic were all competing for the same inference budget. Naive serving meant either burning money or making evaluators wait.
Most of the wins came from boring fundamentals applied carefully: aggressive prompt caching for repeated system prompts and rubrics, batching judge calls so a single forward pass scored many candidates, speculative decoding for the latency-sensitive path, and routing easy queries to smaller models while reserving the big model for the cases that actually needed it. KV-cache reuse alone paid for itself within a week.
None of this is novel in isolation. The work is in the plumbing — knowing which optimization is safe for which call site, and instrumenting enough to prove it didn't quietly degrade quality.