deepdive: a source-trust engine for a local research agent

A local research agent is only as good as the sources it trusts — and search will rank an anyone-can-publish Google Doc right next to a peer-reviewed paper. Here's a measured, two-axis source-trust engine, built as a clean decomposed epic — down to the real bug that scored a fabricable doc as authoritative, found, measured, and fixed.

The engineering problem

A local research agent that fetches, reads, and synthesizes web pages is only as useful as the sources it trusts. Search results are not curated by authority — a peer-reviewed paper, a government dataset, a content-farm article, and an arbitrary Google Doc can all land in the same results page. If the agent treats them identically, it confidently synthesizes junk alongside evidence. If it just defers to search ranking, it inherits the ranking's own biases.

The naive answer is to filter by domain allowlist. That scales poorly, is static, and misses the interesting case: an unknown domain that still has structural markers of authority (an academic subdomain, a .gov TLD, a recognized preprint host). You want a scorer, not a list.

The harder problem: you can define a scorer, but how do you know it's working? Heuristic code that "feels right" accumulates blind spots silently. You need a benchmark that can show you the distribution of your trust scores across real queries — so you can see whether primary is being given out too freely, or whether the scorer degrades on edge cases you didn't anticipate.

That is the engineering problem deepdive's source-trust engine was built to solve.

What deepdive does

@askalf/deepdive is an open-source local research agent: ask a question, it plans a search strategy, fans out across multiple queries, fetches pages via a headless browser, extracts relevant content, and synthesizes a cited answer — entirely on your own machine, routing every LLM call through your own endpoint. Zero hosted dependencies.

This case study is about one specific engineering investment: the source-authority / trust-scoring engine, built as a clean, decomposed, benchmark-backed epic across eight pull requests.

The source-trust engine: two axes, not one

Before Epic #111, deepdive's keep/discard logic was relevance-only: does this page answer the question? That is necessary, but not sufficient. A relevance-only agent will keep a highly-relevant page from a fabricable source (anyone can make a Google Doc), score it identically to a peer-reviewed paper, and cite both with equal confidence.

Epic #111 — "source authority, a second trust axis" — added a perpendicular signal: who published this, and how much should we trust that publisher? The result is a two-axis keep decision: relevant and authoritative. Sources that are relevant but low-authority are downweighted or flagged; sources that are authoritative and primary are promoted in the multi-search fan-out.

Building it in four stages (the decomposed epic)

The epic was decomposed into four phases, each merged as a discrete PR. This matters: it means the work is reviewable at each stage, the history is auditable, and no phase silently carries the complexity of all four.

P1 — Foundation: the scorer itself

PR #112 (2026-06-15) introduced src/source-authority.ts and src/domain-filter.ts: the core scoring function that maps a URL to a trust tier (primary, secondary, unknown, low). PR #113 wired the scorer into the keep stage, so authority ranking shapes which sources survive to the synthesis step.

P2 — Transparency: surface the signal

PR #114 made the trust tier visible in output and the --json flag, so users (and downstream tools) can see why a source was kept or dropped — not just that it was. Trust scoring that happens invisibly is trust scoring that cannot be audited.

PR #115 (2026-06-18) released the core engine as v0.26.0source authority as a second trust axis.

P3 — Measurement: the benchmark

PR #116 added a benchmark test (test/bench-score.test.mjs) that measures the distribution of trust scores across a realistic query sample: how many sources are scoring primary vs. secondary vs. unknown vs. low? This is not a correctness test — it is a calibration tool. If primary shows up at 80% of your sources, the scorer is too generous. The benchmark makes that visible.

P4 — Application: close the feedback loop

PR #120 (2026-06-24) used the authority signal actively: the multi-search fan-out now biases toward primary sources, preferring queries and sources that have already demonstrated authority in the session. The trust score stops being just a label and becomes an input to fetch strategy.

PR #124 (2026-06-25) surfaced the trust badge in the HTML report output, making authority visible at the citation level in the final deliverable.

What a real bug looks like in this system

A correctly-designed trust engine has a known failure mode: pattern-matching heuristics that work for the common case but break on a structurally-similar edge case. The docs. prefix rule — "if a URL starts with docs., it is official product documentation" — is a reasonable heuristic. docs.djangoproject.com, docs.aws.amazon.com, docs.python.org all deserve the primary tier.

The blind spot: docs.google.com also matches — and docs.google.com is the Google Docs app, where anyone can publish a document in thirty seconds. So an arbitrary, user-published Google Doc was being scored at the highest trust tier, as authoritative as a government dataset or a preprint archive. That is precisely the fabricable-content-scored-as-trustworthy failure the engine exists to prevent, leaking in through the trust rule's own pattern.

PR #125 fixed it: a small, auditable DOCS_PREFIX_EXCLUSIONS set gates the docs. boost, so excluded hosts fall through to unknown (neutral, not penalized — just not trusted as authoritative). Google's real product documentation (developers.google.com, cloud.google.com/docs) is unaffected.

Source                                           Before          After
docs.google.com/document/d/.../pub  (user-pub)    primary (top)   unknown (neutral)
developers.google.com/...   (Google's real docs) primary         primary (unchanged)
cloud.google.com/docs       (Google's real docs) primary         primary (unchanged)
docs.djangoproject.com, docs.aws.amazon.com      primary         primary (unchanged)

+29 lines, −1 line, pinned by a new regression test. CI: all checks green (Node 20 + 22, actionlint, CodeQL). The fix is in the tests and the logic, traceable end to end. Both the HTML badge (#124) and this fix (#125) ship in the current published release, v0.26.1.

This is the kind of bug a trust engine accumulates silently. The benchmark (P3) makes the symptom measurable — an unexpectedly high primary rate is a diagnostic signal. The decomposed structure (P1–P4) makes the scorer easy to audit and patch without untangling a monolith.

What this demonstrates

A trust problem, not just a retrieval problem. The research agent category is crowded. The differentiating engineering investment here is the decision to build a measured, multi-axis trust system rather than stopping at relevance scoring. That decision has a traceable history: an issue, a decomposed epic, a benchmark, a P4 application loop, and a real bug found and fixed in the trust rules themselves.

Decomposed, reviewable engineering. Eight PRs across one epic, each with a discrete purpose. The history is readable, the decisions are attributable, and each phase is independently verifiable. This is not the output of a sprint that closed an issue in one commit.

Honest calibration over confident heuristics. The benchmark (PR #116) is not a correctness test — it is a tool for detecting when the scorer drifts toward over-trusting. Building the measurement alongside the feature is an engineering discipline choice, not a bonus.

Fully public and runnable. The repository, the npm package, the CI, and the PR history are all publicly accessible. Every claim in this piece has a link you can click.

Honest framing

deepdive has 2 GitHub stars. "Research agent" is a crowded category — the differentiation is in the engineering quality of the source-trust work, not in novelty of the category itself. The epic is small (eight PRs, a few hundred lines), not a multi-year research project. What it is: a complete, verifiable, benchmark-backed implementation of a real design decision, built and shipped in public.

If you are evaluating autonomous engineering output, the trust engine is a useful signal: a non-trivial design problem, decomposed cleanly, measured, and maintained to the point where a new edge-case bug surfaces, gets caught, and gets fixed — in a single, auditable PR.

All claims are verifiable at github.com/askalf/deepdive and the npm registry (as of 2026-06-26).

We build AI systems that have to be right about what they trust — retrieval, source quality, the difference between cited and correct. If you're putting an agent's output in front of someone who will act on it, that's the kind of problem we go deep on.

Start a conversation →
← All writing