Half my model regressions were the measurement lying to me.

I kept "fixing" the model. The prompt deltas wouldn't move the score, a tweak that looked like progress turned out to be noise, and a run hung for eighteen minutes while every timeout reported green. Over and over, the thing I thought I was measuring was the harness, not the model.

This is a note about bench hygiene, written from the inside of a tool I actually run. deepdive is my research agent — it plans sub-queries, fetches sources off the open web, synthesizes a cited answer, and scores itself against golden questions on completion, source count, citation support, length, and cost. The highest-leverage knob in a thing like that is the prompts, and for a stretch I couldn't tune them: every time I changed a prompt and the number moved, I couldn't tell whether I'd improved the model or just rolled the dice on a flaky bench.

Four times now I've chased what looked like a model regression and found a lie in the measurement instead. Here are the four, with the real numbers, because the pattern is the point: fix the harness before you trust the number.

1. The bench couldn't see my prompt changes, because the search backend was rate-limiting me

The first symptom was the most demoralizing kind: nothing I did to the synthesis prompt changed the score in a way I could trust. Run the bench, get one answer; run it again, get a different one. You cannot tune a prompt on a substrate that won't hold still.

I'd assumed the flux was the model — non-streaming serving on a busy day, the usual hand-wave. It wasn't. The fetch layer was the liar. The default search backend, DuckDuckGo's HTML endpoint, was rate-limiting my test box's IP and silently degrading: two runs in the log literally read "primary search produced nothing — retrying via wikipedia," and the answer got built entirely from Wikipedia. For a factual-lookup question about Claude's rate-limit behavior, Wikipedia simply doesn't contain the answer — so the model, correctly, refused to cite sources that didn't support the claim. That's grounded behavior working as designed, and I'd been reading it as a model defect for days.

The fix was egress, not the model. The same box runs a SearXNG instance whose engines exit through a ProtonVPN tunnel — a different IP than the one DuckDuckGo was throttling. Swapping the bench to SearXNG made the measurement come back:

DuckDuckGo   1/3 cite   (2 runs fell back to Wikipedia, 100% Wikipedia sources)
SearXNG      3/3 cite   (support 1.00 / 0.81 / 1.00, zero fallback)

Three out of three, clean, repeatable. So I made SearXNG a bench-only override — the product default stays DuckDuckGo, and a flag reproduces the old behavior — and committed a fresh 5/6 baseline at support 0.92–1.00, retiring the DuckDuckGo-confounded scoreboard it replaced (#102, #105). The reusable trick: when a backend rate-limits your test IP, route the bench through a different egress and you dodge the throttle that was corrupting your numbers. The lesson underneath it is older: a flaky measurement isn't permission to keep tuning. It's a bug in the bench, and it's first in line.

2. The "improvement" that was DuckDuckGo noise inverting on a stable backend

This is the one that justifies the whole discipline. Before I'd traced the backend problem, I'd added grounding rules to the synthesis prompt — use verbatim source terms, one claim per cited sentence — and seen a weak positive signal. I was ready to keep them.

Once the bench was stable on SearXNG, I re-ran the experiment properly, and the rules failed their own metric (#97). They made the model cite more, but the extra citations didn't survive the lexical verifier: on the academic question, +9 cites, zero verified, and support collapsed 0.93 → 0.58. Support regressed on all five measurable questions. The "weak positive signal" I'd almost shipped was DuckDuckGo-flake noise that inverted the moment the substrate held still. n=1 caveat applies, but I reproduced the rejection on the stable backend, which is exactly the point: the first read was on a broken bench, so the first read was worthless.

The verifier I'm scoring against here is itself deliberate. Every [N] citation gets checked against the source it points to by a pure lexical recall function — tokenize, drop stop-words, score overlap — explicitly not a second model, because an LLM judge would reintroduce the hallucination class the verifier exists to catch. That mattered here: a deterministic metric is the only kind that can tell you your prompt change made things worse. A model-graded one would have happily agreed with whatever I'd just told it to do.

3. The date line that helped one prompt and hurt another

The planner — the part that writes sub-queries — had no idea what year it was. So recency-sensitive questions got anchored to the model's training-time notion of "recent," and with a date filter on, the post-fetch cull then threw out what those stale queries returned. The fix is obvious: tell it the date. Today's date: YYYY-MM-DD, and instruct event-shaped sub-queries to use absolute dates.

That helped event queries. The first version also broke something I wasn't watching: it blanket-anchored every sub-query, and the bench caught scholarly citation support dropping 0.68 → 0.44, reproduced twice (#95). The mechanism is that search engines match text, not intent. Stamping a bare year token onto a query aimed at arXiv or OpenAlex distorts the keyword match — a 2026 paper isn't more relevant to a timeless concept than a 2019 one, but the token says otherwise. A counterweight rule — conceptual and scholarly sub-queries stay timeless — was load-bearing, and it restored scholarly support to 0.92.

Here's the part I want to underline, because it's the kind of thing only a bench surfaces. The same date line that helped in the planner had earlier hurt in the synthesis prompt, where it correlated with the hedge-mode failure and got rejected outright. Same words, opposite sign, depending on which prompt they lived in. There is no universal "add the date" best practice. Placement is the variable, and the only way you learn that is by measuring each placement separately instead of trusting that a good idea is good everywhere.

4. The eighteen-minute hang where every timeout passed

A run wedged for about eighteen minutes. The deepdive process, four headless-Chrome children, and the proxy were all sitting idle — not spinning, not erroring, just dead. And every visible timeout was green. The per-fetch deadline (30s) hadn't fired. The per-search deadline (15s) hadn't fired. By every instrument on the dashboard, the system was healthy. It was not healthy. It was a corpse with a pulse oximeter that read fine.

The hole was a protocol call with no timeout at all. The fetch path drove a real browser, and called page.evaluate to run script in the page — which, unlike most of the Playwright surface, accepts no timeout argument. If a page's renderer main thread blocks after domcontentloaded — a modal dialog, a stray window.print(), a document that never settles — that evaluate promise stays pending forever. It never frees its slot in the fixed-size fetch pool, so the pool bleeds a worker per wedge until it deadlocks, and not one of my carefully-set timeouts is wired to the call that's actually stuck (#87).

The fix (#99) was to stop trusting per-operation deadlines and race the entire page lifecycle against a hard wall — 2× the fetch timeout plus 10 seconds — then on breach, raise a wedge error, skip the page, and force-close it without awaiting the close (a wedged page may never settle its own close() either). The takeaway is uncomfortable: a timeout you can see is not the same as a timeout that's covering the thing that hangs. An instrument that always reads green isn't reassurance. It's an instrument that isn't measuring the failure.

The retry I built, measured, and made opt-in

The last one is where bench discipline earns its keep, because the disciplined move was to not ship.

A factual-lookup run kept aborting at 408 seconds with "0 sources / aborted due to timeout." I first filed it as another fetch wedge. The verbose trace said otherwise: the sources came back fine, and the stall was in synthesis. The model's roughly 8,000-token answer takes about 110–150 seconds to generate, but the non-streaming client had a 120-second whole-call timeout — so it timed out mid-generation and re-ran the entire 8k-token synthesis from scratch, three times, ~360 seconds of work to produce nothing. The retry wasn't resilience. It was a way to fail three times slower.

The real fix was to stream synthesis — even in JSON mode — so the timeout bounds only the connection, not the whole generation: one pass, ~130 seconds, roughly 3× faster, with an idle-token deadline so a genuine stall still fails fast (#106). But a residual remained: about 20% of large generations still stalled mid-stream. The obvious reflex is a retry. I tried it, and reverted it, because the stall persisted across the retry window — same failure, more wall-clock.

Before blaming the upstream, I had to rule out my own plumbing — the traffic crosses a tunnel to reach the model pool, and a tunnel is a fine suspect. So I stood up a box-resident runner with zero tunnels, sitting on the same internal network right next to the proxy and the search instance, and ran the question five times each way. The stall rate was identical, ~20%, in-network and over-tunnel (#104). That's an A/B that falsifies the convenient hypothesis: it wasn't my tunnel, it was the upstream dropping tokens mid-stream on big generations. So I left the issue open and parked the retry as a documented, opt-in design instead of shipping it on by default.

That's the senior move I keep relearning. A retry that fires on a deterministic upstream stall doesn't fix anything — it pays the failure cost again. Knowing when not to ship a retry is worth more than the retry, and the only thing that told me which case I was in was an A/B I built specifically to prove myself wrong.

Why this is the whole point

None of these four were clever models or clever bugs. They were the measurement lying, in four different dialects: a backend throttling my IP and falling back to Wikipedia, an "improvement" that was noise inverting on a stable bench, a prompt line that helped one place and hurt another, and a timeout that wasn't watching the thing that hung. Every one of them, for a while, I attributed to the model. Every one of them was the harness.

So the rule I run by now is boring and load-bearing: fix the harness before you trust the number. Make the substrate hold still — stable egress, deterministic verifier, real deadlines on the calls that actually block — and only then read the score as a fact about the model. A bench you haven't audited isn't evidence. It's a confident narrator that happens to be wrong, and an LLM has enough of those already.

This is what Own Your Stack looks like one layer down from the slogan. If you can't see your own measurement substrate — the egress, the timeouts, the grader — you don't own your evaluation, you're renting someone else's assumptions and calling the readout truth. I'd rather show you the four times the bench fooled me than a leaderboard I can't vouch for, because the four times are the reason I trust the fifth number.

We build and run AI systems where the measurement is as engineered as the model — benches that hold still, gates that can't be sweet-talked, deadlines on the calls that actually hang. If you're trying to tune something and the numbers won't sit still, that's the kind of thing we're good at pinning down.

Start a conversation →
← All writing