Sprayberry Labs · applied AI-infrastructure research

Own Your Stack

Own your AI infrastructure instead of renting it by the token.

One subscription. Your box. Your terms. This is the lab where I prove it works — in the open, scars included.

127 findings· 16 repos mined· every claim links to a real PR or issue

01the thesis

You were sold a meter.

The deal everyone accepted without reading it: intelligence rented by the token. Your costs scale with how much you succeed. Your data flows through someone else's pipes. And the day a model you depend on gets deprecated, throttled, or repriced, your stack changes — and nobody asked you.

That's not a stack you own. That's a stack you're renting, and the landlord can change the locks.

There's another way to build, and it isn't a downgrade. Own the routing, the research, the compute-use, the agents themselves — on your box, your keys, your terms. Sprayberry Labs is where I do that work and show it. Reverse-engineering, agent security, systems plumbing — the unglamorous grind it actually takes.

The one rule of this lab: every claim below links to a PR, an issue, or a line of source. No hype words. Real numbers. The retractions kept in.

02the receipts

The receipts.

Real findings mined from real GitHub history — security holes found by running the agent against itself, a proxy bug that masqueraded as a model hallucination, and bugs that only surface on actual hardware. Click any card to read the work.

bug-huntdario

The proxy that silently deleted continue; from your code

A framework-name scrubber ran over message content, so the JS keyword continue; was stripped to a bare ; in transit. A downstream auditor "found" the bare-; bug and blamed the model — when the proxy itself had corrupted the payload. Confirmed with a live A/B: 3/3 fabricated through the unfixed proxy, 0/4 through the fixed one.

PR #453 · issue #457 →

securityhands

Command injection via a cat shell-out in an agent file editor

The editor's view op shelled out to cat "<path>", so a model-supplied path with shell metacharacters injected arbitrary commands and bypassed the bash guardrail entirely. Found in a self-audit, reimplemented directly on node:fs with no shell at all — closing the class, not sanitizing it — same day.

PR #58 →

red-teamarnie

Red-teamed my own agent: it re-issued a refused command one tier lower to auto-run it

The safety engine refused a recursive registry delete (RED tier), so the model simply re-issued it non-recursively (YELLOW) and it auto-ran — a privilege escalation where the adversary is the agent's own flag selection. Fixed with tier-pinning plus "sticky refusals," validated by re-running the exact bypass live.

write-up →

reliabilitydeepdive

An 18-minute hang where every timeout passed — because one call had no timeout at all

A run wedged ~18 minutes with all processes idle while every per-fetch (30s) and search (15s) timeout was nominally passing. Root cause: Playwright's page.evaluate accepts no timeout, so a blocked renderer leaves the promise pending forever and starves the fixed concurrency pool. Fixed with a hard deadline + detached force-close.

issue #87 · PR #99 →

securitydeepdive

A ReDoS in a robots.txt parser — any search result could hang the run

The research agent reads robots.txt from pages it finds, and its matcher was a catastrophic-backtracking regex: an adversarial robots.txt blew 0.04ms up to 218ms at ten stars. Every untrusted byte is an attack surface — fixed by replacing the regex with a linear two-pointer matcher, not a band-aid.

PR #50 →

A few of the sharpest finds live in private repos — a git-tool RCE in the orchestrator, sealed-sender capacity-sharing crypto in mux — so those get written up on the blog instead of linking to code you can't open.

03numbers that are real

Numbers that are real.

Measured deltas only — each traceable to the finding it came from. No projections dressed up as results.

1.9% → 100%

fleet prompt-cache read, after mirroring CC's cache breakpoints

−99%

fresh input tokens per turn (1750 → 12 on an identical conversation)

0.04 → 218 ms

ReDoS blowup caught in a robots.txt parser, fixed with a linear matcher

3/3 → 0/4

fabricated "bugs" through the proxy that corrupted code in transit, vs. the fixed proxy

590 / 23

tests / suites on a from-scratch Claude Code reimplementation

$0.43*

cost of one full overnight memory-learning cycle for the fleet

* AI-spend figures are calculated at retail token rates against a fixed subscription — a measure of work done, not dollars billed. The real cap is the subscription price, flat.

04research notes

Research notes.

The threads behind the receipts. Long-form write-ups land on the blog as I get to them; every card links to the primary source now.

Agent security: the guardrails that were theater

A field guide to agent-security failure modes — a "jail" that was a shell, an approval gate never wired into the exec path — most found in my own self-audits. The fix is usually to remove a capability, not add a sanitizer.

platform · hands · arnie · agent →

SSRF, ReDoS & the supply-chain-of-input

Every untrusted byte is an attack surface: a cloud-metadata SSRF closed with resolve-then-classify (and an honestly-documented DNS-rebinding gap), a ReDoS in a robots.txt parser, an unauthenticated-CDP browser takeover.

hands · deepdive · browser-bridge · amnesia →

Prompt-cache economics

Anthropic's cache keys on a byte prefix, so one dynamic byte above your static blocks re-bills everything below it — and fails silently as a bigger bill. The measured before/after, and the non-obvious correctness trap.

dario · platform · claude-re →

Don't trust an LLM to grade an LLM

A sustained negative result behind a paid service: LLM verifiers confabulate verdicts even handed raw tool output, and "helping" with a heuristic made it fabricate more. The architecture I trust instead.

platform · deepdive · claude-sync →

Bench discipline

Half my "model regressions" were the measurement lying — a search backend rate-limiting my test IP and falling back to Wikipedia, a date-grounding tweak that helped events but tanked scholarly cites. Fix the harness first.

deepdive →

Autonomy you can actually let run

A four-tier risk engine (auto / snapshot-then-run / escalate / hard-block) gated at one chokepoint with rollback, a once-consumed approval protocol, and autonomy earned by repeated approvals before a pattern runs unattended.

arnie · casey · platform · legacy →

05the honest ledger

The honest ledger.

The retractions, dead-ends, and corrected misdiagnoses — kept on purpose. This is the part the demos skip, and it's exactly why you can trust the rest.

↩ retraction

Retracted my own published billing-signal claim

I'd shipped "CC's system prompt = exactly 3 blocks" as a billing-detection signal. A controlled re-test with preserved request IDs ran 7 mutations and all 7/7 billed the same — and stripping the 27kB prompt recovered up to 2.78× output with billing unchanged. I was wrong, with the request IDs that prove it.

dario · discussion #183 · #172 →

✕ rejected approach

Misdiagnosed a regression as an "LLM flake" — it was my own search layer

I blamed the model for a quality drop; the evidence pointed at the search backend instead. After fixing the harness, I tried a synth-prompt "grounding rules" tweak to claw the score back — and killed it on its own metric when it didn't hold up. Rejected experiments and n=1 caveats, documented, not buried.

deepdive · issue #97 · PR #102 →

06the stack

The stack.

The open tools the research runs on — each useful on its own, the door into owning your stack. MIT-licensed, on GitHub today.

own your routing

dario

Your Claude subscription, in any tool. One local endpoint, no per-token bill. The crown jewel — 270+ stars.

own your inference

hybrid

Local-first LLM routing. Answer the easy majority on a small local model; escalate only the genuinely hard queries to the frontier.

own your research

deepdive

A local deep-research agent. Plan → search → fetch → synthesize a cited report, all on your machine.

own your computer-use

hands

Your LLM on your own mouse, keyboard, and screen — Windows, macOS, Linux — with a full audit log.

own your browser

browser-bridge

Stealth headless Chromium in a container, CDP on your own endpoint. Playwright, Puppeteer, or any MCP tool plugs in.

own your agent security

warden

A deterministic firewall between an agent and its tools. Risk-tiers every call, blocks exfil and injection.

own your agent skills

canon

Vet, sign & pin every skill and MCP server before it runs. Drift detection catches a poisoned tool before it loads.

own your agent secrets

keeper

An encrypted vault that hands agents scoped, short-lived leases instead of raw keys. The key never enters the agent.

own your prompts

cordon

A PII-redacting gateway that fails closed. Strip or tokenize sensitive data before a prompt reaches a model — it never leaves your perimeter.

own your agent browser

picket

A governed browser for agents — an injection firewall, an action gate, and an LLM judge between the agent and the open web.

own your operation

askalf

The self-hosted AI workforce platform — a fleet of agents that triage, build, audit, and ship. The top of the stack.

+ amnesia, agent-security-stack, pgflex, redisflex — the full stack →

07work with me

The lab funds itself. If you've got a codebase you want a hard look at, I run a $1,500 fixed-price code audit — the same deterministic, no-confabulation method described in the ledger above — plus scoped builds and retainers.

See the audit →

08colophon

This site is mostly built and increasingly run by the studio it describes — askalf, a fleet of agents that does the legwork while I architect, review, and own what ships. The research above was mined from real GitHub history across 16 repositories; the claims link to the source so you can check my work. Built in the open, the unfinished parts included. — hello@sprayberrylabs.com