Why your agent burns its quota 10x faster: a prompt-cache autopsy

Anthropic's prompt cache keys on a byte prefix. Put one dynamic byte above your static blocks and you re-bill everything below it — and it never throws an error. It just shows up as a bigger bill and a session that hits a wall early.

I run a small fleet of agents through a proxy I wrote, sitting between my tools and the model. For weeks I had a symptom I couldn't explain: long agentic sessions hit a wall through my proxy, but the genuine Claude Code client on the same account sailed for hours. Same plan, same machine, same model. One of them drained the window in a fraction of the time.

The cause turned out to be one of the most expensive footguns in the whole LLM-app design space, and almost nobody talks about it because it never fails loudly. It just costs more.

How the cache actually keys

Anthropic's prompt cache is byte-prefix-based. You mark a few breakpoints in your request, and the server caches the request up to each marked point. On the next call, it walks the prefix and reuses the longest run of bytes that matches exactly — then bills only what comes after as fresh input.

Read that again, because the word doing all the work is exactly. The cache key is the literal byte sequence of everything above the breakpoint. Change one byte high up — a timestamp, a token count, a path that includes today's date — and every cached byte below that change is invalidated. Not corrupted, not errored. Silently re-billed as if you'd never cached at all.

That's the trap. The cache doesn't punish you with an exception. It punishes you with an invoice.

The breakpoint my proxy was missing

Here's what my proxy was doing. It cached the system prompt — one breakpoint, the obvious one — and it stripped every message-level cache_control on the way through. Felt reasonable. The system prompt is the big static block, right?

Incomplete, in two ways that compound.

First, the tools schema. For an agent that's 10–20KB of JSON — tool names, descriptions, parameter shapes — and it is bit-for-bit identical on every single turn. With no breakpoint on it, every turn re-billed the entire tools block as fresh input. Second, and worse: the conversation itself. An agent's whole value is that the transcript grows — turn 4 carries turns 1 through 3. With no rolling breakpoint on the messages, the entire growing history re-billed as fresh input on every turn. The longer the session ran, the more it paid to re-send what it had already sent.

I measured it. Across the fleet, cache-read ran about 1.9%. Real Claude Code, on the same traffic shape, runs 70–90%. That gap is the whole story: nearly all of my input was being treated as new, which means it was draining the Max 5h/7d window roughly 10 to 50 times faster than the genuine client. The "wall" wasn't a rate limit. It was me paying full freight for tokens Anthropic would have given me back for a tenth of the cost.

The fix is to copy the real client exactly

Anthropic gives you four cache breakpoints. The genuine client uses all four, and once you see the layout it's obvious why:

breakpoint 1   system prompt (static identity)
breakpoint 2   system prompt (the rest)
breakpoint 3   the last tool definition
breakpoint 4   a ROLLING marker on the last message

Two on the system blocks, one on the tools, and — the one I'd been stripping — a rolling breakpoint that walks forward to sit on the last message every turn. That last one is what caches the growing conversation. As the transcript extends, the breakpoint moves with it, so each turn's already-sent history is served from cache and only the genuinely new tail is billed fresh.

I wrote a function that mirrors that breakpoint layout exactly — caches the last tool, plus a rolling breakpoint on the last message — and ran it against an identical four-turn conversation, before and after.

                 BEFORE          AFTER
fresh input      1750            12        (-99%)
per-turn fresh   19→18→519→1194  3→3→3→3
session cache-r  ~1.9%           100%

Fresh input dropped from 1750 tokens to 12 across the session — a 99% cut. But the shape is the real point. Before, per-turn fresh input grew: 19, then 18, then 519, then 1194, climbing as the conversation got longer. After, it went flat: 3, 3, 3, 3. The cache no longer cared how long the session ran.

And that flatness is the durable win. The "before" curve grows linearly with session length; the "after" line stays flat. So the savings don't just exist — they widen the longer the agent works. A short session barely shows it. An hours-long agentic run is where the gap becomes the difference between hitting a wall and sailing. The exact symptom I'd started with, root-caused to four breakpoints. The change is public, in my proxy dario.

There's a quieter lesson in there too: not caching the tools and conversation wasn't only a cost bug, it was a fidelity bug. The real client caches them, so a proxy that doesn't is sending a differently-shaped request — one more way to look unlike the genuine thing. Copying the client's caching exactly is both cheaper and less detectable.

The canary that caught a single dynamic byte

The fleet proxy was the obvious version of this bug. The next one was sneakier, and I only caught it because of a tripwire. On a separate system — one of my private platforms — I run a canary that fires a known agent ("the Writer") with identical input on a schedule and watches the cache numbers. A second identical call should be a cache read. Instead it logged two cache creations back to back:

run #1   cache_write 44,351   cache_read 0   $0.0637
run #2   cache_write 44,427   cache_read 0   $0.0567

Two creations, zero reads, on byte-identical input. The proxy bug all over again, just wearing a different coat: it was re-caching roughly 40,000 tokens every run, paying the 1.25x cache-write premium each time and getting none of the 0.10x read discount back. Six cents a run sounds like nothing until it's every scheduled fire, forever.

The root cause was one block in the wrong place. The system prompt was assembled by interleaving static pieces (identity, vision instructions, the base prompt) with dynamic ones — memory context, a per-execution worktree path, and a budget hint built from NOW(). That budget hint changed every single run by definition, and it sat above static blocks. So the first dynamic byte — a timestamp that's different every time — invalidated every static block downstream of it. The 40k of stable prompt below the timestamp was getting re-cached purely because one volatile byte sat in front of it.

The fix isn't to delete the dynamic context — the agent needs it. It's to re-role it. Split the assembly into two zones: a byte-stable system prompt (identity, vision, base, memory instructions — nothing that changes), and a dynamic "execution context" preamble moved down into the user message. The model receives the exact same bytes; they just live below the cached prefix now instead of poisoning it. Warm-run cache-read goes from 0 to ~40k, cache-write from ~44k down to ~4k, and per-run cost drops 60–70%. Same information to the model, a fraction of the bill.

The rule that falls out: anything dynamic goes after everything static. Sort your prompt by volatility. Identity and tools never change — they go up top, behind the breakpoints. Timestamps, budgets, per-run paths — they go at the bottom, ideally in the user turn. Interleave them and you've defeated the cache for everything underneath, and the model won't tell you. The bill will.

The correctness trap that has nothing to do with cost

There's one more failure mode here, subtler than placement, and it bit me on a third system — a personal reimplementation of the client's caching strategy. This one isn't about where you cache. It's about the fact that touching the cache can silently change the thing you're caching.

The Anthropic API lets a message's content be either a plain string or a list of typed blocks. To attach cache_control you need the block form — you can't hang metadata off a bare string. So the tempting move is: take the last message, wrap its string content into a single text block, attach the marker, send it.

That wrap is the bug. Here is the load-bearing distinction:

  • cache_control itself is metadata. It's excluded from the cached-prefix identity — adding it doesn't change the key. Marking a system block is safe, because you mark it the same way every turn, so the bytes render identically each time.
  • But converting a message's content from "hello" to [{"type":"text","text":"hello"}] is a representation change. It alters the actual serialized bytes of that message. And that does change the prefix.

Now combine that with the one thing that's always true of a conversation: which message is last shifts every turn. On turn 1 you wrap message 1 because it's last. On turn 2, message 1 is no longer last — it's history, and you render it as a bare string again. So message 1's bytes are different on turn 2 than they were on turn 1: block form then, string form now. The prefix doesn't match. The cache read silently fails. No error, no warning — just a hit rate that quietly craters and a cost that quietly climbs, for a reason that looks like nothing at all in the logs.

That's why system wrapping is safe and message wrapping is not. A system block is marked identically every turn, so its bytes are stable. A message moves through the "is it last?" boundary, so wrapping it makes its representation oscillate. The safe move is to apply your breakpoints to a copy of the request and never let the wrap touch the messages you keep — mark the rolling breakpoint, send, and leave your real transcript marker-free so the next turn re-renders the same bytes it did before.

What ties these together

Three systems, three versions of one bug. A proxy that cached too little. A scheduler that put a timestamp in the wrong place. A reimplementation that changed bytes while trying to cache them. Different code, same root: the cache keys on an exact byte prefix, and every one of these quietly violated the prefix without anything raising its hand.

None of this throws. That's what makes it worth writing down. A crash gets fixed in an afternoon because it screams. A cache that silently degrades from 90% to 2% just sits there as a line item nobody reads, draining a quota for weeks. The only way I found all three was measurement: a cache-read percentage I actually watched, and a canary that expected a read and screamed when it saw a write.

This is what Own Your Stack means in practice. When you run the inference path yourself instead of renting it by the token, the cost structure is yours to see and yours to fix. I could measure cache-read at 1.9%, trace it to a missing breakpoint, copy the real client's layout, and watch fresh input fall 99% — because the whole path was in my hands. Rent it through someone else's abstraction and that gap is invisible: you just pay it, and never know there was a tenth-the-price version sitting right there.

Measure your cache-read rate. If it isn't up near where the real client runs, something above your static blocks is changing every turn. Go find the dynamic byte.

We build and run software on AI infrastructure that shifts under it — agents, proxies, the cost and reliability work that decides whether a system is viable or just expensive. If you're running models at any scale and the bill or the behavior doesn't add up, that's the kind of thing we're good at tracing to the byte.

Start a conversation →
← All writing