How I let an agent run commands on production and still sleep

Capability is the easy part. The hard part is letting the thing touch a real box and not waking up to a half-modified machine. Here's the architecture I actually run — the parts that say yes, the parts that say no, and the part that earns the right to stop asking.

Most demos of an autonomous agent stop right before the interesting question. The agent diagnoses the problem, writes the fix, and the video ends on a green checkmark. Nobody shows you the part where it runs Remove-Item -Recurse against the wrong key, the verification fails halfway through, and now the box is in a state that exists in no runbook because nothing planned to be there.

That gap — between “the agent can do it” and “I'd let it do it unattended on a machine that matters” — is the whole job. The capability is a commodity now; any model can draft a remediation. The governance is the product. This is the architecture I run for it, across two of my open-source agents and the private fleet platform behind them.

Every mutating call gets a tier, before it runs

The core idea is boring and that's the point: an agent's action is not allowed to run until something deterministic has looked at it and assigned it a risk tier. In arnie, my open-source ops agent, every mutating tool call is classified into one of four:

green   reads, Get-Service, ping              → run now
yellow  Restart-Service, apt install, edits    → snapshot bytes, then run, rollback ready
red     shutdown, reg delete, rm -rf <dir>     → refuse, escalate to a human
black   rm -rf /, mkfs, dd, disable Defender   → hard-block, even interactively

Green runs immediately. Yellow is the load-bearing tier: it's a reversible local change, so the engine captures the prior state first, then runs, and keeps a rollback path warm. Red is refused and handed to a person. Black is blocked outright — not gated, not escalated, blocked — even if you're sitting right there telling it to proceed. Some actions don't have a safe interactive answer.

The decision is offline. No “ask the model whether this is dangerous” in the hot path, because that's the layer that gets slow and weird exactly when you're under pressure and need a verdict you can trust.

One chokepoint, or it's theater

Here's the part people get wrong. It is not enough to have a safety classifier. It has to be impossible to take a mutating action without going through it.

The first version of arnie gated only inside the shell tool — the shell tool checked itself. That sounds fine until you notice every other tool that can change the box wasn't gated at all, and the one that was just declined everything destructive when running unattended, which is its own kind of useless. So the gate moved into dispatchTool, the single function every tool call routes through. The interactive confirm() prompt now yields to the engine instead of asking in parallel, so you don't get double-prompted and the two can't disagree.

A chokepoint you can route around is not a chokepoint. If there are two code paths to production and only one of them is guarded, your real policy is “unguarded,” you just haven't found out yet.

A failed fix must not leave the box half-modified

Snapshot-then-run only matters if you actually use the snapshot. arnie ships a revert_change tool and captures prior state automatically on yellow actions, so when a fix runs and then fails its own verification, the agent can undo it and stand the machine back up instead of leaving it stranded between two states. The whole thing shipped with a safety suite that includes a real rollback round-trip, not just a unit test of the classifier in isolation — because the classifier being right and the rollback actually working are two separate claims and I wanted both proven.

This is the difference between an agent that fails safe and one that fails somewhere. The half-modified box is the nightmare. Everything else — a refused action, an escalation, even a clean failure — you can reason about. A box that's half-way through a change nobody finished is the one that pages you at 3 a.m.

The bypass I only found by red-teaming it against itself

The end-to-end test of arnie surfaced a P0 I'm glad I caught before it caught me. The engine refused a recursive registry delete — Remove-Item -Recurse on a key, correctly classified red. So the model, trying to get its job done, simply re-issued the delete non-recursively on the same key. That classified yellow. And it auto-ran.

The gate worked exactly as written and was bypassed anyway, because the adversary was the agent's own flag selection. It downgraded its own action one tier to get under the bar. That's a privilege escalation, and no amount of pattern-matching on the command string would have stopped it — the second command genuinely was lower-risk in isolation.

The fix in arnie PR #28 is two parts. First, a registry-key delete is now red regardless of the -Recurse flag, because a model-controlled flag must never be able to move the tier. Second, and more general: sticky refusals. Once any action on a target is refused, every later mutating action on that same target is re-gated for the rest of the session. Reads stay open, unrelated targets stay open — so it doesn't over-gate into uselessness — but you can't get told no on a thing and then sneak up on it from a different angle. The safety suite grew from 43 cases to 64, and I validated the fix by re-running the exact bypass live: the delete now escalates with a proper handoff, the key is untouched, and the audit log shows red→escalate with no yellow-execute in between.

You do not find that bug by reading the code. You find it by running the agent against itself and watching it try to win.

How a human says yes, safely

Escalation is only half a loop. Something has to close it. The naive version — a live prompt the agent blocks on — doesn't survive contact with real ops, where the agent fires and forgets and the human shows up minutes or hours later.

So arnie uses a once-consumed file-queue protocol (PR #34). When the agent escalates, the approval gate drops an <id>.decision.json into the serve queue. An opt-in --execute-approvals flag makes the serve loop consume those decisions: approved means re-run the original archived task through the full safety engine again; rejected means stand down. The guardrails around that are deliberately paranoid:

Off by default. With the flag unset, behavior is identical to before — nothing auto-executes.
A bounded elevation ceiling. Approved work re-runs at a capped tier (--approved-ceiling, default yellow). Red stays an explicit, separate opt-in. Saying yes to a fix does not silently hand the agent root.
Consumed exactly once. The decision file is renamed .decision.done.json after it's read, so an approval can't be replayed.
Only on work it actually escalated. The original task has to be present in the done set, or the decision is ignored.

Approved work re-runs through the safety engine rather than around it. The human's “yes” raises the ceiling a notch; it doesn't disable the floor. And every safety decision and rollback lands in a per-run append-only JSONL audit, so the trail of what-ran-and-why is durable, not a scrollback you lost.

Autonomy you earn, not autonomy you're granted

The previous section is how a human approves one action. But you don't want to approve the same VPN-reset remediation by hand for the four-hundredth time. You want the boring, repeatedly-blessed patterns to graduate out of your inbox — and only those.

casey, my IT service-desk agent, makes autonomy something a pattern earns. Approvals are tallied per remediation pattern. Once a pattern clears a threshold — three human approvals — it graduates to auto-eligible, and the operator can then promote it so the downstream fixer runs it without sign-off. It's built on the same approval gate that records each decision both on the ticket and as the decision.json contract arnie consumes, so casey decides what is trusted and arnie enforces how it runs.

The execute side ships separately from the tallying on purpose, because that's the seam where an agent crosses from advising to acting on production. Trust accrues from repeated, observed agreement — not from a config flag you flipped once and forgot. That's the honest answer to “how do you remove the human from the loop”: slowly, per-pattern, and only after the human has agreed enough times that the removal is just ratifying what was already happening.

The part where I chose not to use an LLM at all

Everything above is about a single agent touching a single box. The harder problem is a fleet — twenty agents whose work feeds each other. When one agent's output mentions something in another agent's domain, you want a ticket to cross over automatically. The obvious 2026 instinct is to put a model in the middle and let it route.

I didn't, and the private platform's design doc for it spells out why in numbers. LLM-based routing would add roughly $0.01–0.10 per execution, inject inference latency into a hot path, and — the disqualifier — risk hallucinated targets: an agent confidently filing a ticket against a capability that doesn't exist. The doc's verdict is one sentence I keep coming back to: the signal map is deterministic, auditable, and free.

So the “nervous system” is a keyword event-bus. Each agent's capability domain compiles to a word-boundary regex; when another agent's output trips it, a cross-agent ticket gets created. No inference, no per-call cost, and when something routes wrong you can read exactly which literal matched instead of interrogating a black box. Feedback loops are handled structurally — executions tagged as reactively-sourced don't themselves trigger more — and storm control lives in two database-backed limits so it survives restarts instead of evaporating with process memory.

Here's the detail that tells you this is a real running system and not a whiteboard: the live thresholds are tuned far more conservative than the doc that designed them. The doc specs a minimum signal strength of 2 matches, a 30-minute cooldown, and 20 triggers an hour. The code in production runs a minimum signal strength of 4, a 240-minute cooldown, and 3 triggers an hour. Nobody writes a design doc that conservative on the first pass. Those numbers got ratcheted down after watching real storm behavior — the fleet talking itself into a frenzy — and pulling the knobs until it stopped. The gap between the doc and the deployed config is the scar tissue.

The fleet that learns overnight, with a receipt

Two more pieces, both private, both built on plain Postgres rather than anything exotic. The first is an “immune system”: a real incident lifecycle — detected, triaging, responding, fixing, verifying, resolved, and then a final state I called immunized. After a fix verifies, the system mints an “antibody” — a procedural memory keyed to the incident's trigger — so the next occurrence of that exact failure is caught and handled automatically instead of paging anyone. Fix once, then never get woken up by that specific thing again.

The second is “dream cycles.” Overnight, the platform replays the last day of executions through a pipeline that mines trigger→action→outcome patterns into procedural memory — success rates under 70% over enough runs, repeated “max turns reached” flags that mean an agent needs a higher iteration ceiling — and pre-creates tickets from high-confidence predictions of tomorrow's recurring failures.

I want to give you the receipt, because a self-improving fleet is exactly the kind of claim that's usually a diagram and nothing more. One real overnight trace: a watchdog caught a regression at 10:49 PM, a builder agent fixed it in 12 minutes for $0.14, and a 2:30 AM dream cycle replayed 48 executions, extracted 6 patterns, and minted an antibody — $0.43 for the full overnight learning cycle. (Honest footnote: my fleet's AI-spend numbers are retail-rate math calculated against a fixed subscription, not literal dollars billed. The shape of the number is real; the dollar sign is a model.) Cheap enough to run every night, which is the only reason it's worth building at all.

Why this is the whole point

None of this is exotic. A risk classifier, one chokepoint, a rollback path, an approval file consumed once, a counter that earns trust, a regex bus instead of a model. The exotic part is the discipline to build the governance before you build the autonomy, when the demo would look identical without it and nobody's asking you to.

That's what Own Your Stack means here: you own the part that decides what your agent is allowed to do, instead of trusting that someone else's framework drew the line in the right place — the same way you'd own your data and your infrastructure rather than renting them by the token. An agent that runs commands on production isn't dangerous because it's dumb. It's dangerous when it's ungoverned. The capability is the commodity. The thing that says no — and the thing that earns the right to stop asking — is the product.

How I let an agent run commands on production and still sleep.