Your embedding model vanished and nothing paged you.
The outages that cost you aren't the ones that fall over loudly. They're the ones that keep returning 200, keep accepting writes, and are quietly wrong for days. Here's one I found by accident — what hid the damage, how I measured the real blast radius, and the part that actually matters: making the system catch the next one without me.
A service that crashes hard gets noticed in minutes. Alerts fire, a dashboard goes red, a phone buzzes. That outage is annoying but honest — it tells you it's broken. The expensive ones don't. They keep answering, keep looking healthy, and corrupt your data in the background while every “is it up?” check says yes. I found one of those last week. I found it because I asked a maintenance question, not because anything told me to look.
A question, not an alert
The platform box runs its own local inference: a small chat model and an embedding model, both on CPU, both served by a local Ollama. The fleet's memory layer — the thing that lets agents store what happened and recall it later by meaning — calls that embedding model on every write. We moved those embeddings off a hosted API months ago. That was the whole “own your stack” bet in miniature: the inference is mine, the bill went down, and the dependency went away.
The question on the table was unglamorous: should we keep investing in local inference, or let it coast? Before answering from opinion, I ran a read-only health check. That's a habit worth more than any single tool: the cheapest way to be wrong about a system is to reason about it without looking. So I looked.
The thing that was healthy, and the thing that wasn't
The email lead-triage observer that runs on the same box came back green — cron alive, container up, its heartbeat file seventy-six seconds old, mail flowing. Fine. But the drift check, the one that compares the box against the repo that's supposed to be its source of truth, flagged two files. And following that thread is where the afternoon turned.
The drift itself was the good kind of bad: the box was more hardened than the repo — a container had been locked down by hand and never committed back — which meant the next deploy would have silently reverted the hardening, because deploys flow one way, repo to box. I round-tripped the changes and confirmed them byte-for-byte against the live machine before trusting them. A deploy script that quietly undoes your own security fixes is its own small horror. But it wasn't the headline. The headline was three commands deeper.
The embedding model was just… gone
The model list on the box showed the chat model and an idle one. The embedding model the memory layer depends on — the one with a real job — was simply not there. I pulled it back, and the application logs told the rest: the memory layer had been throwing model "…" not found on every embedding call. For about four days. Without a single alert.
The cause was mundane, which is the point. A box rebuild four days earlier had restored the container but not its pulled models. The chat model got re-pulled by the deploy path; the embedding model didn't. Nothing was watching for the difference, so nothing said anything. The component was “up” the entire time. It just couldn't do the one thing it existed to do.
Null wasn't the bug. The zero vector was.
The obvious next question is blast radius. Every memory written during those four days failed to embed — were they lost, or just stored without a vector? I queried for rows with a null embedding, got a reassuringly small number, and felt relieved. The relief was premature, and it was premature for an instructive reason.
Instead of trusting the query, I read the embedding code. On failure, it doesn't store null. It falls back to a zero vector — a thousand-and-twenty-four dimensions of zero. That row looks embedded. It isn't null. It passes every “is the value present?” check you'd think to write. And it is worse than null, because a zero vector doesn't error on read either — it sits in the similarity index matching nothing, or matching everything at a score of zero, depending on the metric. Silent, plausible, wrong: the trifecta.
A null-count had missed most of the damage. The honest query — null or zero-vector, created since the rebuild — found thirty-nine corrupted rows across four memory tiers, twelve of them silent zero-vectors in a single tier that the first pass scored as perfectly fine. (The cheap way to detect them, if you're on pgvector: a zero vector is the only vector whose inner product with itself is zero. One predicate — embedding <#> embedding = 0 — flushes them all out.)
The lesson is older than embeddings. When you measure the damage from a failure, count the failure mode the code actually chose, not the one you assumed it would. A graceful fallback you forgot you wrote is indistinguishable from a landmine.
Repair with the system's own logic, not a guess
Re-embedding thirty-nine rows sounds trivial: read the text, call the model, write the vector. The part that's easy to botch is which text. Embed the wrong field and the row gets a vector that represents something other than what it says — still not null, still wrong, and now hidden from you a second time behind a value that looks legitimate.
So I read how the live path builds the embedding input for each tier and reproduced it exactly: the raw content for one, a composed situation-action-outcome string for another, a trigger pattern for a third, a label-plus-description for the knowledge graph — each truncated at the same 1,200-character cap the production code uses, fed through the same model. Then I re-embedded all thirty-nine and verified that zero null-or-zero rows remained. Faithful, not approximate. A remediation that's close to what the system would have done is just a nicer-looking version of the same corruption.
Owning your inference means owning your monitoring
Here's the part that matters, and it isn't the fix. The fix was an afternoon. The real question is why four days of the fleet's memory failing left no trace anyone would see.
Because the watchdog on that box was watching the wrong thing. It had exactly one job — keep the lead observer's heartbeat fresh — and it did that job flawlessly the entire time. It had no concept that an embedding model existed. The thing it guarded was up; the thing nobody guarded was down for four days. That's the recurring geometry of a silent outage: the monitor and the failure live in different places, and the gap between them is exactly how long the failure gets to run.
So the fix that counts isn't re-pulling a model — that's one command. It's a second guard that asserts the required models are actually present every ten minutes and pages if one goes missing. A vanished model is now a ten-minute alert instead of a four-day archaeology dig. When you rent inference, that kind of monitoring is the vendor's problem and you assume it's handled. When you own the inference, it's yours — and the failure you didn't think to watch for is precisely the one that gets you.
This is the same instinct the rest of the platform already runs on: a fix isn't finished when the symptom is gone, it's finished when the system will catch the next instance by itself. An incident you can't auto-detect a second time is an incident you'll simply have again, just as quietly. Closing that loop is the difference between fixing a bug and removing a class of them from your life.
The over-reach, and the gate that caught me
One honest coda, because the clean version of this story would be a lie. While verifying the repair, I found a large population of knowledge-graph rows with null embeddings — far more than four days could explain. For a moment it looked like a much bigger gap. It wasn't: those were provenance nodes, one per execution, with no description and nothing meaningful to embed. Null by design, correctly excluded from semantic search, not damage.
But chasing that down surfaced something real: my own repair had over-reached. The backfill had embedded three of those provenance nodes — they were null, they fell inside my filter, so they got a vector they were never meant to have. Three rows, trivial in impact, and exactly the class of bug I'd spent the afternoon hunting: a fix that quietly does slightly more than it should.
I went to null them back out — a write to the production database — and the system stopped me. Not a model second-guessing me; a deterministic gate. I'd been doing read-only investigation, and a mutation to shared production state is a different risk tier that requires explicit sign-off. It refused, told me precisely why, and waited. I confirmed; it ran; three rows corrected. That gate firing on my own cleanup is the whole governance argument working in miniature — the boundary between looking and changing is enforced, not vibed, and it doesn't care that the person it's stopping is the one who built it. (I've written before about the risk-tiered architecture that draws that line; this is what it feels like from the inside when it tells you no.)
Why this is the point
None of it was exotic. A model went missing after a rebuild, a fallback hid the damage, a monitor watched the wrong component, and a fix overshot by three rows. Every one of those is ordinary. The discipline is in refusing to treat any of them as a one-time cleanup: round-trip the config so it can't silently revert, count the failure mode the code actually chose instead of the one you assumed, repair with the real logic instead of a plausible guess, and — the one that earns its keep — leave a guard behind so the next occurrence is a ten-minute alert, not a four-day rot.
That's what Own Your Stack costs and what it's worth. Rent your inference and the silent failure modes are abstracted away right up until they aren't, and then they're a support ticket you can't see into. Own it and they're yours to find — but they're also yours to instrument, which is the only reason the second time is a non-event. The capability was a commodity: the model was back in one command. Knowing it had gone, knowing how much it had quietly broken, and making certain I'd know next time without having to look — that was the work.
Honest footnote: this was a human-in-the-loop investigation, not an agent healing itself unattended — I asked the question and approved the production write. The platform's automated detection is the part that's getting better here, and the point of the new guard is that the next occurrence of this exact failure won't need me to be curious that day. The cost framing elsewhere on this site is retail-rate math against a fixed subscription, not literal dollars billed; the shape is real, the dollar sign is a model.
We build and run software on AI infrastructure that shifts under it — local inference, agent memory, the silent failure modes that don't page anyone, and the boring guard that turns the next four-day outage into a ten-minute alert. If you're running models in production and want the failures to be loud, that's the kind of thing we're good at.
Start a conversation →