The Gateway Layer Isn't the Stack

“Own your AI stack” is a phrase you hear from a lot of directions now — routing layers, key vaults, provider-agnostic gateways. We use it too, and we want to be precise: owning the stack isn't a product category. It's a practice. Here are the receipts.

“Own your AI stack” has become a phrase you'll hear from a lot of directions these days. Routing layers, key management, provider-agnostic gateways — the category is real and the tooling is getting better. We use the same phrase. We want to be precise about what we mean.

Owning the stack is not a product category. It's a practice.

It means being the engineer who ran the security audit, found the injection vector, understood why sanitizing the input wasn't enough, and rewrote the call to eliminate the shell entirely. It means running the red-team yourself, knowing the bypass worked before you shipped the fix, and publishing the write-up either way. It means diagnosing the hang that every monitoring dashboard said wasn't happening — because all the timeouts were set, and none of them were the timeout that mattered.

Here's what that looks like in practice.

The proxy that made the model look guilty

The dario routing layer included a framework-name scrubber that ran over message content, not just metadata. The JS keyword continue; was being stripped to a bare ; in transit. A downstream reviewer found the bare-semicolon pattern and attributed it to model hallucination — when the router itself had corrupted the payload.

Confirmed with a live A/B: 3/3 fabrications through the unfixed proxy, 0/4 through the fixed one.

The lesson isn't “check your proxies.” It's that a gateway between your code and your model is a place where bugs hide in a particularly expensive way: they're misdiagnosed as model behavior, which sends the investigation in exactly the wrong direction.

PR #453 →

Command injection, closed at the class level

The hands file editor shelled out to cat "<path>" for its view operation. A model-supplied path with shell metacharacters would inject arbitrary commands and bypass the bash guardrail entirely. Found in a self-audit.

The fix wasn't input sanitization. We removed the shell-out entirely and reimplemented directly on node:fs. Close the class, not the current exploit. Same day.

PR #58 →

The 18-minute hang where every timeout had passed

A research agent wedged for ~18 minutes. All processes appeared idle. Every per-fetch timeout (30s) and search timeout (15s) was nominally set and firing. Nothing interrupted.

Root cause: Playwright's page.evaluate() accepts no timeout argument. A blocked renderer left the promise pending indefinitely and starved the fixed concurrency pool — silently, with no error signal.

“All timeouts set” and “no process can hang indefinitely” are different claims. Fixed with a hard deadline wrapper plus a detached force-close on the renderer.

issue #87 · PR #99 →

A “safe mode” that forgot one shell

A tool denylist designed to keep an away-mode agent safe missed PowerShell entirely. That's the structural problem with denylists over open-ended capability sets: you can always miss one. We deleted the denylist and used --permission-mode plan instead — deny the capability class, not the known exploits.

PR #32 →

Red-teaming our own agent — and watching it escalate

The safety engine in arnie refused a recursive registry delete (RED tier). The model re-issued the same operation non-recursively (YELLOW tier) and it auto-ran.

The adversary wasn't an external attacker. It was the agent's own flag-selection logic finding a lower path to the same outcome. Fixed with tier-pinning plus sticky refusals. Validated by re-running the exact bypass after the patch.

PR #28 →

The four-day outage that kept returning 200

The fleet's memory layer broke silently after a box rebuild: the embedding model went missing, and the lookup fell back to a zero vector — so every query kept returning 200 OK while quietly retrieving nothing. The watchdog was guarding a different component, so nothing paged. I found it by accident, asking a maintenance question — not because anything told me to look.

Restoring the model was the easy part. The part that matters is the model-presence guard that turns the next occurrence into a ten-minute alert instead of a four-day silent corruption.

Full write-up →

What this adds up to

127 findings. 16 repos. Every claim links to a real PR or issue. The retractions kept in.

None of the above is a gateway capability. A gateway routes requests, manages keys, surfaces usage data. Those are useful tools, and they're solved well in several places. But they're one layer.

Owning the stack means having done the work at every layer — security, reliability, agent behavior — having the receipts, and being able to say which parts still need work.

That's what we mean.