Writing

Research & build notes

Field reports from an independent studio that runs on its own agent fleet. AI infrastructure, the wire-level reality of building on someone else's models, and notes from shipping real software end-to-end. No think-pieces — just what we learned doing the work.

June 26, 2026 Source trust 6 min read

deepdive: a source-trust engine for a local research agent

A research agent is only as good as the sources it trusts — and search ranks an anyone-can-publish Google Doc next to a peer-reviewed paper. A measured, two-axis source-authority engine, built as a decomposed P1–P4 epic with a calibration benchmark — down to the real bug that scored a fabricable doc as authoritative, found, measured, and fixed in one auditable PR.
June 26, 2026 Release engineering 7 min read

dario: a self-healing autonomous release pipeline

A published package that has to track a dependency Anthropic moves every few days — with no manual bandwidth to do it. The closed-loop pipeline that watches, validates, ships to npm, and heals itself: an idempotency gate, three trigger paths engineered around GitHub's own constraints, and self-monitoring guards. Six verifiable auto-releases in 72 hours, zero humans in the loop.
June 24, 2026 AI infrastructure 4 min read

The Gateway Layer Isn't the Stack

“Own your AI stack” isn't a product category — it's a practice. Six receipts from the actual work: a routing proxy that corrupted code in transit and made the model look guilty, a command injection closed at the class level, an 18-minute hang where every timeout had passed, and the four-day silent outage that kept returning 200. Every claim links to a real PR.
June 24, 2026 AI infrastructure 4 min read

Why askalf isn't a framework — and what that means when you build with LangGraph, CrewAI, or AutoGen

Every 2026 roundup of open-source AI agent tooling covers LangGraph, CrewAI, and AutoGen — and askalf, the platform this studio runs on, is in none of them. Partly a visibility gap we're fixing; partly correct, because askalf isn't a framework. It ships as a running system — dashboard, ticket queue, a fleet of agents, an LLM proxy with cost caps, a queryable execution log, human-in-the-loop — the operational layer those frameworks could run inside.
June 23, 2026 Operations 8 min read

Your embedding model vanished and nothing paged you

The outages that cost you aren't the ones that fall over loudly — they're the ones that keep returning 200 and are quietly wrong for days. A four-day silent outage of the fleet's memory: a model gone after a rebuild, a zero-vector fallback that hid the blast radius, a watchdog guarding the wrong thing, and the guard that turns the next one into a ten-minute alert.
June 19, 2026 Agent security 7 min read

The whole agent-security stack, behind one MCP server

Five separate tools — a tool-action firewall, a skill supply-chain gate, a secrets vault, a PII redactor, a browser injection firewall — composed into one MCP server an agent can call mid-task. The bug only stacking reveals, the live proof all five fire, and the dull question that turned out to have a real answer: was the stack actually up to date? It wasn't.
June 18, 2026 AI engineering 6 min read

Citation verification isn't enough. My research agent scored a perfect 1.00 on content farms.

Every AI research tool now verifies its citations. Mine gave a flawless score to an answer where every source was an AI content farm — because a verified citation proves the source said it, not that it's true. The orthogonal fix has no model in it.
June 16, 2026 AI infrastructure 11 min read

Your CPU isn't bad at LLMs — it's bandwidth-starved

One law predicts every CPU inference number: decode tok/s × model-bytes ≈ a constant 14–18 GB/s memory wall. So I ran an 8B model at 7.6 tok/s and a 30B model larger than the RAM it ran in, on a GPU-less 2013 desktop — and kept the limits no lever fixes: prefill is ~240× off a GPU, speculative decoding is GPU-shaped, and a cheap router can't catch its own confident wrongness.
June 15, 2026 AI engineering 9 min read

You can't use an LLM to grade an LLM. I tried, for money.

I sell a $1,500 code audit. The obvious way to scale it is one model finding the bugs and a second checking the first's work. I built that — it doesn't work, and the way it fails is worth more than if it had. A negative result, with the receipts.
June 15, 2026 Agent security 8 min read

The agent guardrails that were theater

Controls that existed but never ran: a "jail" that was a shell, an approval gate never wired into the exec path, a denylist that forgot the Windows shell — several found by attacking my own agent. The fix is almost always to remove a capability, not add a sanitizer.
June 15, 2026 Security 8 min read

Every untrusted byte is an attacker

AI systems ingest hostile input by design — search results, robots.txt, synced files, a URL a model picked. So classic appsec applies: a ReDoS that blew 0.04ms to 218ms, a cloud-metadata SSRF, an unauthenticated browser takeover — found by reading the live deployment against the source, including one gap I didn't close.
June 15, 2026 AI infrastructure 8 min read

Why your agent burns its quota 10x faster: a prompt-cache autopsy

Anthropic's cache keys on a byte prefix, so one dynamic byte above your static blocks re-bills everything below it — silently, as a bigger bill. The autopsy: fleet cache-read 1.9% versus real Claude Code's 70–90%, the Max window draining 10–50x faster, and the fix that cut fresh input 99%.
June 15, 2026 AI infrastructure 7 min read

The bugs your CI will never catch (because it runs clean Linux)

The failures that only surface on real hardware and the real install path: one AUTH_FAILED symptom hiding two unrelated bugs, a regex that returned zero sessions on Windows, a global CLI that was a silent no-op for every npm -g user. Test the published artifact, not the source.
June 15, 2026 AI engineering 9 min read

Half my model regressions were the measurement lying to me

A search backend rate-limiting my test IP and silently falling back to Wikipedia, a date tweak that tanked scholarly citations 0.68→0.44, an 18-minute hang where every timeout passed. Half my "model regressions" were the harness lying. Fix the harness before you trust the number.
June 15, 2026 Agents 8 min read

How I let an agent run commands on production and still sleep

A four-tier risk engine gated at one chokepoint with automatic rollback, a once-consumed approval queue, autonomy earned by repeated approvals — and the deliberate choice to route a fleet with a deterministic keyword bus instead of an LLM. The architecture for letting an agent touch production without burning it down.
June 15, 2026 Cryptography 8 min read

Sealed-sender capacity sharing: blind signatures, and being honest about what they don't hide

I implemented Chaum's 1983 RSA blind signatures from scratch — ~550 lines, 85 assertions — so a trust group can share Claude capacity unlinkably. The part that matters more than the crypto: it protects members from each other, not from Anthropic. Exactly when it's the wrong tool.
June 15, 2026 Agent security 6 min read

I built a firewall for my AI agents — what it actually took

OpenClaw became 2026's first big AI-security disaster: one-click RCE, a poisoned skill marketplace, exfil. So I built warden — a firewall that decides what an agent's tools may do before they run. A deterministic gate at 95% recall / 98% precision, an LLM judge that deobfuscates evasion, and a tamper-evident audit. The honest version, misses included.
June 13, 2026 AI infrastructure 5 min read

When Anthropic pulled two models overnight — what it takes to keep a proxy honest

On 2026-06-12 a government directive disabled Claude Fable 5 and Mythos 5 for every Anthropic customer. If your tooling keeps advertising a model that no longer answers, it's not neutral — it's broken. Here's how we shipped a fix to dario in a day, and the design rule behind it.

Research & build notes

deepdive: a source-trust engine for a local research agent

dario: a self-healing autonomous release pipeline

The Gateway Layer Isn't the Stack

Why askalf isn't a framework — and what that means when you build with LangGraph, CrewAI, or AutoGen

Your embedding model vanished and nothing paged you

The whole agent-security stack, behind one MCP server

Citation verification isn't enough. My research agent scored a perfect 1.00 on content farms.

Your CPU isn't bad at LLMs — it's bandwidth-starved

You can't use an LLM to grade an LLM. I tried, for money.

The agent guardrails that were theater

Every untrusted byte is an attacker

Why your agent burns its quota 10x faster: a prompt-cache autopsy

The bugs your CI will never catch (because it runs clean Linux)

Half my model regressions were the measurement lying to me

How I let an agent run commands on production and still sleep

Sealed-sender capacity sharing: blind signatures, and being honest about what they don't hide

I built a firewall for my AI agents — what it actually took

When Anthropic pulled two models overnight — what it takes to keep a proxy honest