You can't use an LLM to grade an LLM (I tried, for money)

I sell a $1,500 code audit. The obvious way to scale it is to have one model find the bugs and a second model check the first one's work. I built that. It does not work, and the way it fails is worth more than if it had.

This is a negative result, written up the same way I'd write a win — because the failure is specific and reproducible, and because almost every "AI reviews your code" pitch I see is quietly built on the exact assumption I'm about to tell you is false.

The product is a code audit. A human pays $1,500, points me at a repo, and gets back a report: real bugs, ranked, each one quoted against the line it lives on, each one something I'd stake my name on. The expensive part isn't finding candidate bugs — a model will generate fifty before lunch. The expensive part is the second pass that throws out the forty-five that are wrong. That's the part I wanted to automate. That's the part that broke.

The setup: a grounded auditor and an independent verifier

The first design was the textbook one. An Auditor agent reads the source and drafts findings. To keep it honest, I didn't let it editorialize — I told it to run sed on the exact lines it was accusing and paste the verbatim output into the finding, so the evidence would be the file itself, not the model's memory of the file.

Then a second, independent Verifier agent — different prompt, no shared state — would re-run sed on each cited location and rule the finding verified or rejected. The theory is the one everyone has: a fresh model with the ground truth in hand will catch the first model's mistakes. Two sets of eyes. Defense in depth.

Here is what actually happened.

The auditor fabricated, and forged its own evidence

The Auditor filed five CRITICALs against one file, all the same class: a "systematic missing-continue bug." The claim was that a loop was missing its continue statement and would fall through. Five real line numbers, in real source, in agent.ts.

Every one of those five lines had a perfectly good continue or break sitting right there. Five out of five fabricated. A 100% false-positive rate on a single confident batch.

Worse than the fabrication: it forged the evidence too. I'd told it to paste verbatim sed output as proof, and it produced "Extracted code — verified" blocks that did not match what sed returns. The grounding instruction didn't ground it — the model generated text in the shape of a tool result, complete with the reassuring word "verified," because that's what a finished finding looks like, and looking like one is the only thing a language model optimizes for. And because this was wired into a fleet, those five fabricated CRITICALs auto-paged a downstream agent to "fix" bugs that did not exist. The confabulation didn't just sit in a report; it dispatched labor.

The verifier was supposed to catch exactly this. It didn't.

Fine — that's what the second agent is for. So I watched the Verifier do its job. It really did run sed. It really did get back source lines that plainly read continue;. And it returned, in writing:

5 verified, 0 rejected

It read the ground truth, the ground truth contradicted the finding, and it stamped the finding PASS anyway. That's the whole result in one sentence: the verifier pattern-matches bug-shape to PASS, regardless of what the tool output says. A "missing-continue bug" has the silhouette of a true finding — a real file, a real line, a plausible category, an authoritative tone — and the model grades the silhouette. The bytes from sed are right there in its context, and they lose to the vibe of a well-formed bug report.

Nobody designs around this because it's counterintuitive: we assume a verifier that has the evidence is a verifier that uses it. It isn't. An LLM judge shares the failure class of the thing it's judging — both are producing the most likely-looking token sequence, and "this finding checks out" is an extremely likely-looking thing to say after a finding. Handing it the file doesn't fix that. It just gives the confabulation a more convincing backdrop.

Then I tried to help, and measured it getting worse

Here's the part that's most instructive, and the most counterintuitive. I figured the gate was close — maybe it just needed a hint about what to look for. So I added a heuristic rule to the Auditor's prompt naming the specific pattern of a real empty-statement bug: the bare semicolon, if (cond) ;, the loop body that's accidentally a no-op.

I had a small fixture set I was scoring against. Before the rule: 4 of those passed clean. After I named the bare-semicolon pattern: 1 passed. Naming the bug regressed the auditor.

The mechanism is anchoring, and you can watch it happen. The moment I told the model "this is what the bug looks like," it went hunting for that shape and started finding it in code that didn't contain it — manufacturing bare-semicolon bugs the way it had manufactured missing-continue ones, now with more enthusiasm because I'd primed the target. The instruction meant to sharpen its precision blunted it. I reverted the rule.

That number — 4 PASS to 1 PASS from a single "helpful" sentence — is the one I want senior engineers to sit with. Prompt-tuning a judge isn't a gentle slope toward "good enough." You can push a prompt and make the thing measurably worse at the exact job you're tuning it for, because the lever you're pulling (telling it what to attend to) is the same lever that drives the fabrication. There's a ceiling, and it's below "trustworthy."

What I trust instead: a gate with no model in it

So I stopped trying to make the LLM honest and built something that doesn't need it to be. The verifier is now a small script — no model, no inference, no judgment. For every finding it takes the code the Auditor quoted and does a whole-line byte comparison against the actual source file. If the quoted line isn't in the file, byte for byte, the finding is rejected and the process exits non-zero. That's the entire idea.

It's dumb the way a fuse is dumb, and it can't be convinced. A fabricated "missing-continue bug" quotes a line that says continue; while claiming it doesn't — the quote and the source don't reconcile, and the gate kills it, every run, no good days and bad days. On the same finding set that broke the LLM verifier, the byte gate failed the fabricated findings and passed the genuine ones. It catches 100% of the fabrications, because catching fabrications stops being a judgment call once you refuse to let a model make it.

The shape of the architecture I actually run is three boxes, and the order matters:

AI drafts findings   →  deterministic gate culls fabrications  →  human signs off
(generative, cheap,     (no LLM, byte-exact, no                  (judgment, accountable,
 confabulates freely)    opinions, can't be talked to)            the part you pay for)

The model does what it's good at: generating plausible candidates fast. The gate does what models are bad at: refusing the ones that don't survive contact with the file. A human does the irreducible part — deciding what's worth reporting and putting a name on it. The mistake was ever asking the generative layer to also be the honest layer. Those are different jobs, and one of them can't be done by something whose objective is plausibility.

This isn't one weird repo. It's a conviction I keep acting on.

This isn't a single anecdote I'm overfitting. The same belief shows up everywhere I build, usually as a deliberate decision to not reach for an LLM where it's the obvious move. In deepdive, my research tool, every [N] citation in a synthesized answer gets checked against the source it points to. The natural way to build that in 2026 is to ask a model "does source 3 support this sentence?" I refused to, on purpose. The citation verifier is a pure lexical recall check — tokenize, drop stop-words, score overlap — precisely because an LLM judge would reintroduce the exact hallucination class the verifier exists to catch. You do not audit confabulation with a confabulator. It's lossy the other way (it can flag a truthful paraphrase), but it can't invent agreement that isn't there, and that's the property I need.

In claude-sync, an automated audit handed me five "P0" empty-statement bugs — if (cond) ; — at six specific line numbers across three files. Every cited location used continue correctly, and a sweep for the pattern across the whole source returned zero matches. I did not fix them, because there was nothing to fix. I wrote them up in the pull request as confabulations, with the proof, and left them unactioned. Treating model output as adversarial-until-verified isn't cynicism; it's the only stance that survives the thing being wrong five times in a row with total confidence.

And a related thread that's almost funny in hindsight. In dario, my LLM proxy, a name-scrubber meant to mask competitor framework names was running over message content — so the JavaScript keyword continue; in code passing through the proxy got rewritten to a bare ; in transit. A downstream auditor "found" the bare-semicolon bug and reported it as a model hallucination. It wasn't the model: the proxy had corrupted the payload, and the audit blamed the LLM for a data-integrity bug one layer down. A live A/B nailed it — the audit fabricated the finding 3 of 3 times through the broken proxy and 0 of 4 through the fixed one. Even when an LLM looks like it's hallucinating, checking the actual bytes is what tells you whether it really is.

Why this is the whole point

"Use an LLM to check the LLM" is the default reflex right now, and it's load-bearing in a lot of products charging you money. It feels like rigor. It isn't. A second model is not an independent check when it fails the same way as the first — it's a more expensive way to get the same confident wrong answer, now with a quorum.

The deterministic gate is the unglamorous half of the system, and it's the half that makes the other half sellable. I can let a model draft as recklessly as it wants — fabricate, anchor, forge its little "verified" blocks — because nothing it produces reaches a client until it survives a comparison against the file that can't be sweet-talked. That's what lets me put a price and my name on the output of a thing I know to be an unreliable narrator. This is what Own Your Stack means at the engineering level: own the part of the pipeline that's allowed to say no, and never let it be a model.

The capability is the commodity. The gate is the product. I'd rather show you the five fabricated CRITICALs than a clean demo, because the five fabricated CRITICALs are the reason the demo is honest.

You can't use an LLM to grade an LLM. I tried, for money.