Citation verification isn't enough: a perfect score on content farms

Every serious AI research tool now verifies its citations — re-reads the cited page and confirms the claim is really there. Mine does too. Then it gave a flawless verification score to an answer where every single source was AI-generated spam. The verifier wasn't broken. It was answering the wrong question.

This is a negative result about my own tool, written up the way I'd write a win — because the failure is specific, reproducible, and quietly baked into almost every "AI that researches for you" product shipping right now. It's the companion to you can't use an LLM to grade an LLM: the same family of mistake — verification that looks like rigor but measures the wrong thing — on a different axis.

The reflex that feels like rigor

The default test for whether an AI answer is trustworthy has collapsed into one word: cited. Show your sources and you're doing research; don't and you're hallucinating. So every serious research tool now shows sources, and the better ones go further and verify them — re-read the cited page, confirm the claim is actually on it, flag the ones that aren't.

deepdive, my local research agent, does this with a deliberately non-LLM verifier — a lexical matcher, for the same reason laid out in the companion post: an LLM judge would reintroduce the exact hallucination it's meant to catch. It nails the classic failure: a confident sentence with a [3] after it where page 3 says no such thing. When the matcher is happy, it prints a clean bill of health.

Here is the problem in one line: citation verification answers "does the answer match its source?" and stops there. It says nothing about whether that source deserved to be believed in the first place. Those are two different questions, and on a large class of real queries they give opposite answers.

Two axes, and the day they came apart

I dogfooded deepdive on eight diverse real questions. The synthesis — reading the pages and writing the report — was genuinely strong: seven of eight graded A. The failure wasn't in the writing. It was upstream, in which sources got picked.

The question was "what are the latest open-weight LLMs and how do they compare?" — a recency query. Recency is where the web is worst: the authoritative pages (a lab's own release post, a benchmark with a stated methodology) lag days behind a swarm of SEO-optimized, AI-generated "TOP 10 NEW AI MODELS" pages that exist to capture exactly that search. deepdive took what search ranked. What search ranked was, in order: aiflashreport.com, aireleasetracker.com, gpt0x.com, lmmarketcap.com — every one a content farm.

The agent read them, synthesized a confident comparison full of specific benchmark numbers — numbers that are trivially fabricable precisely because no primary source was ever consulted — and the citation verifier checked each claim against its farm page, found the text matched, and reported:

citation support: 1.00

A perfect score. The same score a federalreserve.gov-grounded answer gets. To the person reading the report, the two are visually identical: confident prose, neat [N] markers, a clean verification footer. One is grounded in the primary record; one is a hall of mirrors. The trust signal couldn't tell them apart, because it was never measuring the thing that differed.

That's the whole finding in a sentence: a lexical citation check proves the source said it — not that the source is credible. Verifying harder doesn't help. You can verify a citation to a content farm to 1.00 all day. The verifier isn't wrong; it's orthogonal.

The fix almost everyone reaches for is the wrong one

The instinct is immediate: add a step that judges whether a source is trustworthy. And in 2026 the instinct for how is equally immediate — ask a model. Hand an LLM the URL and the page and have it rate credibility.

The companion post is the long version of why that's a trap; the short version is that a model asked "is this source credible?" pattern-matches the shape of credibility — authoritative tone, dense citations, a confident layout — which is exactly what a good content farm is optimized to fake. You'd be grading farm spam with a judge the farm was built to fool, and paying per token to do it. A second model that fails the same way as the first isn't a check. It's a quorum on the wrong answer.

So the source-authority layer I added to deepdive has no model in it. It scores a source's authority from its domain alone — pure, deterministic, no network, no inference — the same philosophy as the citation verifier it sits beside:

scoreAuthority(url) → { tier: "primary" | "reputable" | "unknown" | "low", score, reason }

The design is boost-led, because that's where you can be confident. Government, education, and standards/academia domains (.gov, .edu, arxiv.org, ietf.org, w3.org) and official docs (redis.io, kubernetes.io, the Rust and AWS docs) score primary — false positives there are near zero. Wikipedia, Stack Overflow, and GitHub score reputable. Anything unrecognized stays unknown and neutral — never punished, because a niche-but-legit blog shouldn't lose for being unfamous. And a small, conservative, hand-curated denylist of content farms actually seen in the wild scores low. Precision over recall: a missed farm is acceptable, a misflagged real source is not.

It does two things with that score. First, at the moment deepdive decides which candidates to spend its limited fetch slots on, it ranks authoritative sources ahead of whatever SEO ranked first. Second, and more important, it reports the axis it had been hiding. The same run that scored 1.00 on citation support now also prints, right beneath it:

## Citation health

⚠ Source trust: low — of 6 source(s), 0 primary/reputable,
0 unrecognized, 6 low-authority (content farms). Distinct from
citation support: this rates whether the sources themselves are credible.

Two numbers, side by side, that disagree — and that's the point. citation support: 1.00 and source trust: low together are the honest read: yes, the answer faithfully reflects its sources; no, you should not trust its sources. Neither number alone tells the truth. The bug was ever showing only the first one.

The rule this leaves

A citation has two independent failure modes. The claim can fail to match the source, or the source can fail to deserve belief. Verifying the first and reporting it as "verified" is how a tool launders content-farm spam into a confident, green-checkmarked answer.

This generalizes past my CLI because every tool in the category — the hosted ones included — leans on "cited" as the trust signal, and "cited" measures support, not authority. The next time a research tool hands you a crisp, fully-sourced answer to a trending question, the question to ask isn't "did it cite its sources?" It's "did it cite anything that was actually worth citing?" — and almost none of them will show you that number, because almost none of them compute it.

What this doesn't do (the honest part)

Domain-level scoring is a floor, not a judgment: it can't tell a careful arXiv preprint from a sloppy one, or catch a primary source that's simply wrong. The denylist is curated and small, so "trust: high" means "no known farms and a primary-source majority," not a guarantee. And the aggregate before/after across a full question set — what fraction of sources move from low/unknown to primary/reputable once you prefer authority — is a measurement I'm finishing wiring into the bench, and I'll report it with the numbers whether they flatter the feature or not. The single failure above is deterministic and reproducible today; the rest gets the same numbers-first treatment that made me reject my own synthesis-prompt "improvement" the day its metric came back negative.

Why this is the point

"Cited" is doing a lot of unearned work as a trust signal right now, and it's load-bearing in products charging you money. A citation verifier — even a rigorous, non-LLM one — only proves the answer is faithful to its sources. Whether those sources deserved your faith is a second question, and on exactly the queries where it matters most, the two answers diverge.

The fix isn't a smarter judge. It's the humility to compute both numbers and let them disagree out loud. That's what Own Your Stack means at this layer: own the part of the pipeline that's allowed to tell you a source is junk, keep it deterministic so it can't be sweet-talked, and put it in front of you instead of hiding it behind a checkmark. deepdive runs on your machine, routes through your own model, and the scorer, the ranking, and the trust signal are about two hundred lines you can read end to end — not one of which asks a model anything.

Citation verification isn't enough. My research agent scored a perfect 1.00 on content farms.