The bugs your CI will never catch (because it runs clean Linux)
Your test suite spins up a clean Ubuntu container and calls your functions directly. It is green. It is also lying to you about an entire class of failure that lives only on real hardware and the real install path.
I run my tools on the actual box — a fleet of machines that reboot, a Windows desktop, npm-installed CLIs sitting behind a global symlink. Not a pristine container that gets torn down after the assertions pass. The gap between those two worlds is where the most expensive bugs of the last month lived: every one passed CI, and every one was completely broken for a real user.
The common thread is that a clean-Linux test suite invokes your source. It imports the module and calls the function. It never reboots the host between two process lifetimes, never runs on a filesystem that names directories the way Windows does, never installs the package the way an end user does. A whole category of failure is structurally invisible to it. Here are four I hit for real, with the receipts.
One symptom, two unrelated bugs
This is the headliner, because it is the one that taught me to stop trusting a single error string.
I have a fleet agent that connects each device to a bridge over a WebSocket. After any systemctl restart or reboot, devices started failing auth. The bridge slammed the connection shut and the log said exactly one thing, every time:
AUTH_FAILED
One symptom. It turned out to be hiding two completely unrelated bugs, and I only found the second one because the first fix didn't fully explain what I was seeing.
Bug A: an encryption key derived from your network interfaces. The agent encrypted its config at rest with a key that was a SHA-256 over every currently-up, non-loopback MAC address. The intent was “tie the secret to this machine.” The problem is that the filter only excluded loopback — so docker0, veth*, tun0, wg0 all got folded into the key material. Start a container, bring up a VPN, let Wi-Fi randomize its MAC, and the set of interfaces that are “up” changes between one process lifetime and the next. Now the key is different. AES-GCM does the correct thing and fails its auth-tag check, because the ciphertext genuinely was sealed under a different key.
That alone is a reboot-only bug — nothing in a single test process changes the host's interface list mid-run. But the part that turned a recoverable error into a silent disaster was the error handling. loadConfig wrapped the decrypt in a try/catch whose comment said it was there for “old plaintext config files.” What it actually did was return raw — the still-encrypted base64 blob — whenever decryption threw. So the agent kept booting and sent the encrypted ciphertext as its Authorization: Bearer token. The bridge saw garbage and said AUTH_FAILED. A catch-all that can't tell “never encrypted” from “encrypted under a key I no longer have” doesn't recover from the failure; it launders it into a different, less debuggable one.
I proved the instability wasn't transport corruption with a backup-restore experiment: restoring a known-good config file reproduced the exact same AUTH_FAILED after a restart, which only makes sense if decryption itself is per-process unstable (issue #15, PR #18). The fix replaced the MAC-derived key with a random 32-byte keyfile written once with O_EXCL and chmod 600, and a real “please re-pair” error — not silent garbage — for hosts that couldn't migrate.
Bug B: an RFC violation in the WebSocket handshake. Same symptom. Different cause entirely. The agent also passed its API key as a WebSocket subprotocol token in the handshake. RFC 6455 restricts what bytes are legal in that header, and ws@8.20.0 actually enforces it. A base64 key contains =, /, and + — all illegal there — so the moment a key happened to include one of those characters, the client threw SyntaxError: An invalid or duplicated subprotocol and crash-looped on startup before it could even connect (PR #14).
This is the kind of bug that hides for months: it only fires when the random key draws an offending character, so most machines are fine and a few are mysteriously dead. You don't find it by reading the test output. You find it by reading the RFC and the library source to understand why this host crash-loops and that one doesn't. Two root causes, one error string — if I'd stopped at the first fix, the latent crash would have bitten the next key rotation and I'd have been debugging “the bug I already fixed.”
A regex that returned zero sessions on Windows
claude-sync syncs Claude Code session transcripts between machines. To find the sessions, it has to reproduce the exact scheme Claude Code uses to turn a working directory into a directory name. Mine did it with this:
/[/\\:.\s]+/g → replace with "-"
Read the quantifier. The + collapses a run of separators into a single dash. On Linux that mostly doesn't matter. On Windows it is fatal, because a path starts with C:\ — a colon immediately followed by a backslash — and Claude Code writes that as two dashes, one per character. My + collapsed the run and produced one. So every lookup resolved to a directory that does not exist, and the tool reported zero sessions, with no error, for every Windows user.
I didn't assert the fix — I measured it against the real filesystem. For my own working directory:
cwd: C:\Users\masterm1nd\Desktop\sprayberry labs
old: C-Usersmasterm1ndDesktopsprayberry-labs → 0 sessions (dir does not exist)
new: C--Users-masterm1nd-Desktop-sprayberry-labs → 14 sessions (the real dir)
Zero versus fourteen, on the actual disk (PR #7). The correct encoder maps every non-alphanumeric character to a single dash with no run-collapsing, mirroring what Claude Code actually does. No Linux test would ever have caught this, because no Linux path begins with C:\. The bug lives in the shape of the input, and the input only takes that shape on the platform CI doesn't run.
A global CLI that was a no-op for every npm -g user
This one is my favorite, because it was broken for 100% of real users and 0% of the test suite.
deepdive is a research CLI. You npm i -g @askalf/deepdive and run deepdive. Except: after a global install, deepdive --version exited 0 and printed nothing. main() never ran. Meanwhile node dist/cli.js — the exact thing every test and every benchmark invokes — worked perfectly.
The culprit was the standard ESM “am I the entry point?” guard, which compares process.argv[1] to fileURLToPath(import.meta.url). That is correct when you run the file directly. But npm doesn't install a bin by copying it — it installs a symlink: /usr/local/bin/deepdive points at .../dist/cli.js. Under a symlinked invocation, argv[1] is the symlink path while import.meta.url resolves to the real module path. They never match, the guard decides this isn't the entry point, main() is skipped, and the process exits 0 in silence.
It affected every single npm -g install — the primary way anyone runs the thing — and went unnoticed for exactly one reason: the tests run node dist/cli.js directly and never touch the installed bin. I only found it while standing up a container that installs the published package the way a user would (issue #109). The fix is one line — realpathSync() both sides before comparing — plus a regression test that actually symlinks the CLI and spawns it through the link. The test had to be taught to do the thing the real world does.
The UTF-8 BOM that throws JSON.parse
A smaller one, same family. On Windows, plenty of editors save text files with a UTF-8 byte-order mark — three invisible bytes at the front of the file. JSON.parse chokes on them. So in claude-sync, the instant a user hand-edited a config.json in an editor that added a BOM, every subsequent command threw a cryptic parse error. The fix is unglamorous: strip the BOM before parsing, in all three readers, pinned with a regression test (PR #20). You only write a BOM-tolerant reader after a Windows editor has prepended a BOM to a file your Linux CI was never going to produce.
Why none of these were catchable
Line them up and the pattern is identical. The key churned because the host's interface list changed between reboots; the WebSocket crash needed a real random key to draw an illegal character and a real ws version to enforce the RFC; the Windows encoder broke on a path that starts with C:\; the global bin died under a symlink; the BOM came from a Windows editor. Every one is a property of the environment and the install path, not the logic. Your unit tests assert the logic is correct, and the logic was correct. The failure lived in the seam between the artifact you ship and the place it runs — the exact seam your test suite is built to remove so it stays fast and deterministic. That removal is the whole point of CI, and it is exactly why CI can't see these.
So the only fix that works is structural: test the published artifact on the real hardware and the real install path. Install the npm package globally and run it through the symlink. Reboot the box and watch the agent reconnect. Run the tool on the Windows machine, on the disk, and count what comes back. The truth lives there, not in the source tree, and it does not show up until you go and look.
Why this is the whole point
This is what Own Your Stack means in the least glamorous way possible: own the boring parts. The capability — the agent, the sync, the research run — is the easy 20% that demos well. The reboot that breaks decryption, the colon in a Windows path, the symlink npm quietly inserts: that is the 80% that decides whether the thing works for an actual person on an actual machine. Nobody screenshots it, and it is the only part that determines whether you have a tool or a green dashboard.
I'd rather show you the bugs my CI couldn't catch than a passing test suite that promised they weren't there.
We build and run software on AI infrastructure that shifts under it — agents, integrations, the cross-platform reliability work that decides whether a tool survives contact with a real machine. If you're shipping something that has to actually run where your users are, that's the kind of thing we're good at holding steady.
Start a conversation →