Sample Audit Deep-Dive: a ReDoS, found and fixed

Subject: github.com/askalf/deepdive @ d478a23 · TypeScript / Node · ~40 source modules · ~30 test files · Method: static source analysis (the ReDoS was additionally reproduced with a standalone timing harness)

1. Executive Summary

deepdive is in strong shape — a security-conscious, well-tested codebase with the threat model written into its comments, bounded inputs throughout, atomic file writes, and a deliberate, documented effort to avoid ReDoS in its URL handling. The audit found one P2 worth fixing now and four P3 hardening items — no P0/P1.

The P2 is a slightly ironic one: the codebase carefully hand-writes its URL helpers regex-free specifically to dodge ReDoS, but the one place it compiles untrusted input to a regex — the robots.txt matcher — is exactly where a ReDoS slipped in. It's a denial-of-service against your own process (no data exposure), it's externally triggerable on the default path, and the fix is small. We shipped it.

2. Findings

P2✓ Fixed — same dayReDoS via attacker-controlled robots.txt

Before fetching a host, deepdive fetches and respects that host's robots.txt (the polite default). Each Allow/Disallow pattern was compiled to a regex by expanding every * to .* with no bound on the wildcard count. Since the robots.txt is served by the site being researched, that input is attacker-controlled — and .*.*.*… against a non-matching path backtracks polynomially. Reproduced on a 17-character path: 2★ → 0.04ms, 6★ → 2.2ms, 10★ → 218ms — roughly 10× per two extra wildcards, so a robots line with a few dozen legal * takes the per-URL check from microseconds to minutes and hangs the run.

Remediation (shipped, PR #50, CI-green): replaced the regex with a linear two-pointer wildcard matcher — O(path × pattern) worst case, no catastrophic backtracking — preserving the exact */$ semantics, plus a regression test that feeds a 50-wildcard pattern and asserts matching stays linear. Full suite: 433/433.

P3 · hardeningFour hardening items (no live exploit)

robots crawl-delay parsed but never enforced — deepdive honors allow/deny but ignores the host's requested pacing (the field is read and surfaced, then nothing reads it). A stated-politeness gap.
Default search adapter reads the response unbounded — the DuckDuckGo path buffers the full HTML with no size cap, where every other fetch path is byte-bounded. Low risk (trusted endpoint), but inconsistent.
Cache temp-file suffix isn't collision-proof — the atomic-write temp name uses the PID, so two concurrent writes of the same URL could interleave. Latent, not active.
No SSRF guard on fetched URLs — fine for the primary local-CLI use; matters the moment deepdive is run server-side, where a malicious result could reach localhost or a cloud metadata endpoint. Flagged as deployment-conditional.

3. What we did not publish

Two findings an automated pass had flagged were checked against the real code and discarded — included here because not publishing an unproven finding is the discipline you're paying for:

✗ Dropped"Session path traversal" — not a vulnerability here

The save path's id is system-generated, and the load path runs against the user's own ~/.deepdive on a single-user local CLI — the "attacker" would be the user reading their own files. No privilege boundary is crossed, so there is nothing to exploit in this threat model.

✗ Dropped"API key sent in the request body" — correct usage

The Tavily adapter puts api_key in the POST body — which is Tavily's own documented API contract, sent over HTTPS to Tavily's endpoint. No leak; flagging it would have been noise.

4. What's Working — Keep It

Deliberate ReDoS avoidance in the URL helpers — hand-written linear scans specifically to dodge the polynomial regex forms, and documented. The right instinct; the robots matcher was the one spot it wasn't applied (now fixed).
Bounded inputs everywhere — per-fetch byte caps, word caps, PDF page caps, and a deliberately one-level directory walk so --include ~/ can't ingest a thousand files.
Atomic temp-then-rename writes for both sessions and the page cache — no torn reads on crash.
Defensive HTML handling — stray < defused to neutralize partial tags, single-pass entity decoding to avoid double-unescaping, and explicit "this is not a sanitizer" comments where stripping isn't a security control.
Strong test coverage (~30 files including parser edge cases) and CodeQL in CI.

Verification: every published finding was re-extracted from the cited source and confirmed present by a deterministic mechanical gate (5 of 5 passed, 0 fabricated); the ReDoS was additionally reproduced with a live timing harness; two findings that didn't hold against the real code were discarded, not published. Exploit specifics are generalized here (this is a public sample) — a client engagement includes exact reproductions. Static analysis only. Every engagement includes a 60-minute readout call to walk through findings and the roadmap.

We audited a mature codebase, found a real ReDoS — and threw out two findings that didn't survive the code.

1. Executive Summary

2. Findings

3. What we did not publish

4. What's Working — Keep It

Want this for the code you're already running?