1. Executive Summary
deepdive is in strong shape — a security-conscious, well-tested codebase with the threat model written into its comments, bounded inputs throughout, atomic file writes, and a deliberate, documented effort to avoid ReDoS in its URL handling. The audit found one P2 worth fixing now and four P3 hardening items — no P0/P1.
The P2 is a slightly ironic one: the codebase carefully hand-writes its URL helpers regex-free specifically to dodge ReDoS, but the one place it compiles untrusted input to a regex — the robots.txt matcher — is exactly where a ReDoS slipped in. It's a denial-of-service against your own process (no data exposure), it's externally triggerable on the default path, and the fix is small. We shipped it.
2. Findings
Before fetching a host, deepdive fetches and respects that host's robots.txt (the polite default). Each Allow/Disallow pattern was compiled to a regex by expanding every * to .* with no bound on the wildcard count. Since the robots.txt is served by the site being researched, that input is attacker-controlled — and .*.*.*… against a non-matching path backtracks polynomially. Reproduced on a 17-character path: 2★ → 0.04ms, 6★ → 2.2ms, 10★ → 218ms — roughly 10× per two extra wildcards, so a robots line with a few dozen legal * takes the per-URL check from microseconds to minutes and hangs the run.
Remediation (shipped, PR #50, CI-green): replaced the regex with a linear two-pointer wildcard matcher — O(path × pattern) worst case, no catastrophic backtracking — preserving the exact */$ semantics, plus a regression test that feeds a 50-wildcard pattern and asserts matching stays linear. Full suite: 433/433.
- robots
crawl-delayparsed but never enforced — deepdive honors allow/deny but ignores the host's requested pacing (the field is read and surfaced, then nothing reads it). A stated-politeness gap. - Default search adapter reads the response unbounded — the DuckDuckGo path buffers the full HTML with no size cap, where every other fetch path is byte-bounded. Low risk (trusted endpoint), but inconsistent.
- Cache temp-file suffix isn't collision-proof — the atomic-write temp name uses the PID, so two concurrent writes of the same URL could interleave. Latent, not active.
- No SSRF guard on fetched URLs — fine for the primary local-CLI use; matters the moment deepdive is run server-side, where a malicious result could reach
localhostor a cloud metadata endpoint. Flagged as deployment-conditional.
3. What we did not publish
Two findings an automated pass had flagged were checked against the real code and discarded — included here because not publishing an unproven finding is the discipline you're paying for:
The save path's id is system-generated, and the load path runs against the user's own ~/.deepdive on a single-user local CLI — the "attacker" would be the user reading their own files. No privilege boundary is crossed, so there is nothing to exploit in this threat model.
The Tavily adapter puts api_key in the POST body — which is Tavily's own documented API contract, sent over HTTPS to Tavily's endpoint. No leak; flagging it would have been noise.
4. What's Working — Keep It
- Deliberate ReDoS avoidance in the URL helpers — hand-written linear scans specifically to dodge the polynomial regex forms, and documented. The right instinct; the robots matcher was the one spot it wasn't applied (now fixed).
- Bounded inputs everywhere — per-fetch byte caps, word caps, PDF page caps, and a deliberately one-level directory walk so
--include ~/can't ingest a thousand files. - Atomic temp-then-rename writes for both sessions and the page cache — no torn reads on crash.
- Defensive HTML handling — stray
<defused to neutralize partial tags, single-pass entity decoding to avoid double-unescaping, and explicit "this is not a sanitizer" comments where stripping isn't a security control. - Strong test coverage (~30 files including parser edge cases) and CodeQL in CI.