Your CPU isn't bad at LLMs — it's bandwidth-starved

"CPUs are terrible at LLMs because nobody bothered to optimize for them." Plausible, and wrong. There's one physical wall behind every number, and once you see it the whole problem reframes into something you can attack — so I did, on a 13-year-old desktop with no GPU, and ran an 8-billion-parameter model faster than you read.

I wanted to know, concretely, how bad a CPU really is at running a language model — and whether "bad" was a law of physics or just a thing people repeat. So I picked the most hostile box I had and measured everything. A 2013 Intel Core i7-4770 (Haswell, four cores), 16 GB of DDR3-1600, and no GPU — the integrated graphics sat unused the whole time. If a model is usable here, it's usable anywhere. Every number below was measured on that machine.

The headline finding is that the CPU is not bad at this for the reason everyone assumes. People have tried extremely hard to optimize CPU inference — llama.cpp, Intel's oneDNN and AMX, ARM's whole stack. Effort isn't the bottleneck. A single physical wall is, and it predicts every measurement I took to within a few percent.

The one law that explains every number

Generating one token from a dense model requires streaming every weight through the CPU exactly once. That's the whole operation, mechanically: read the weights, do a little arithmetic, emit a token, repeat. Which means decode speed isn't set by how fast the CPU can compute — it's set by how fast it can read. One equation falls out, and it held across every model I ran:

decode tok/s  ×  model-bytes  =  achieved bandwidth  ≈  constant (~14–18 GB/s)

Decode is almost 100% memory-bound. The arithmetic units sit idle, waiting on RAM. My DDR3 has a theoretical peak of 25.6 GB/s, and the workload sustained 14–18 of that — up to ~70% of the physical ceiling on the larger models. Three independent things prove it's bandwidth and not anything else:

The quant sweep. Take one model, change only how many bits each weight uses. Qwen2.5-3B at 7.9 → 4.7 → 3.1 bits-per-weight ran at 4.84 → 7.69 → 11.19 tok/s. Identical math, fewer bytes, proportionally faster — exactly what a bandwidth wall predicts.
Two different models, same wall. An 8B ternary model and a 7B 4-bit model landed at 18.01 and 18.05 GB/s — the same number to two decimals. Their 1.84× speed difference was purely bytes moved, nothing else.
Bigger working sets saturate harder. Utilization climbed 9.5 → 14 → 18 GB/s as models grew, because a larger working set keeps more memory requests in flight and hides DDR3 latency. On this same load, hyperthreading helps: 2 threads gave 5.41 tok/s, 4 gave 6.37, 8 gave 7.78 — more in-flight reads, not more compute.

Once you accept that decode is a bandwidth problem, the entire optimization space collapses to one question: how do I make the CPU read fewer bytes per token? There are exactly three levers, and I ran all three on the potato.

Lever one: ternary weights (the strongest one)

If decode speed is set by bytes, then the most direct lever is to make each weight smaller. Quantization is the dominant knob on a CPU, and its most aggressive practical form is ternary — weights restricted to roughly −1, 0, +1, about 1.6 bits each instead of the 16 they were trained in.

I ran a genuine 8-billion-parameter ternary model (the open Llama3-8B-1.58 reproduction). It has more parameters than the 7B 4-bit model, yet ran 1.84× faster — because it's roughly half the bytes, and the law only cares about bytes. The output was coherent. The number that still surprises me:

An 8-billion-parameter LLM, generating coherent text at 7.6 tokens/sec,
on a 13-year-old GPU-less desktop worth about $100.

7.6 tok/s is faster than most people read. There's a subtlety worth keeping, because it's a real result about where the bottleneck lives: the runtime ran this ternary model without the specialized lookup-table kernels that ternary formats can use — and it still hit the same 18 GB/s wall as 4-bit. Those kernels save compute. On a machine where bandwidth is the bottleneck and compute sits idle, saving compute buys you nothing; we were already saturating the thing that mattered. Those kernels pay off on faster RAM, where compute is the wall. Not here.

Lever two: Mixture-of-Experts (read fewer weights, not smaller ones)

Ternary shrinks the bytes-per-weight term. The other term in the law is weights-read-per-token, and a different architecture shrinks that. A Mixture-of-Experts model routes each token through only a few of its many "expert" sub-networks, so it reads a small active slice of its total parameters per step. That decouples capability (total params) from speed (active params) — precisely what a bandwidth-starved CPU wants.

DeepSeek-V2-Lite (16B total, 2.4B active) decoded at 8.19 tok/s — 2.1× faster than the dense 7B, while carrying 2.3× more total parameters. Bigger brain, faster output. And the proof it's really only reading the routed slice: tok/s × file-size came out to 62 GB/s — 2.4× over the 25.6 GB/s physical peak of the RAM. No dense model can exceed its own memory bandwidth; the only way to "beat" the wall is to not read most of the file. The CPU was moving experts, not the whole model.

Lever three: stack both, and run a model bigger than RAM

The levers compose. Take a MoE whose weights are also near-ternary: Qwen3-30B-A3B (30B total, 3B active) at a dynamic ~1.7-bit quant is an 8.42 GB file — almost exactly this box's 8.4 GB of free RAM. It runs. A 30-billion-parameter model, coherent, at 6.08 tok/s, on a GPU-less 2013 desktop.

The detail that makes the point sharper than the parameter count: free RAM held at 2.84 GB during decode. Only about 5.5 GB of the 8.42 GB file ever became resident. The model is larger than the RAM it ran in — because MoE plus memory-mapping only faults in the experts a token actually routes to, and the cold experts stay on disk, never touched. That is the literal answer to "run anything": you don't need to fit the whole model, only the slice you use.

Three levers, one box, same physics underneath:

lever                  model                      decode    shrinks
dense (baseline)       Qwen2.5-7B Q4 (7B)         3.94 t/s  nothing — reads all weights
ternary                Llama3-8B-1.58 (8B)        7.63 t/s  bits per weight
MoE                    DeepSeek-V2-Lite (16B/2.4B) 8.19 t/s  weights read per token
MoE + ~1.7-bit         Qwen3-30B-A3B (30B/3B)     6.08 t/s  both

Same potato: 7B at 3.9 tok/s up to 30B at 6 tok/s — four times the parameters, faster, by feeding the bandwidth-starved CPU progressively less to read.

But is it any good? Fast is not the same as good.

This is the part most "look what runs on a potato" pieces skip, and it's the part that decides whether any of it is useful. Speed means nothing if the answers are wrong. So I put the lever models and two dense baselines through six checkable questions — arithmetic, a word problem, syllogistic logic, a factual trap, multi-step reasoning, instruction-following — and auto-scored them.

model                          speed     score
Qwen2.5-7B (dense)             3.94 t/s   6/6   clean, correct
Qwen2.5-3B (dense)             ~8 t/s     5/6   only missed the multi-step age
Qwen3-30B-A3B ~1.7-bit (MoE)   6.08 t/s   4/6   nailed hard reasoning, rambled on arithmetic
DeepSeek-V2-Lite (MoE)         8.19 t/s   3/6   verbose, fumbled logic
Llama3-8B-1.58 (ternary)       7.63 t/s   2/6   broken — garbled, echoed the prompt

Speed and capability were inversely correlated. The fastest models were the dumbest; the slowest — the dense 7B — was the smartest. "Fast but dumb" is exactly what the exotic levers bought on this hardware, and pretending otherwise would make the whole exercise a magic trick instead of a measurement.

Two honest caveats so this isn't an unfair indictment. The ternary model's collapse is mostly its specific weakness — a 100-billion-token research reproduction — and the prompt-echoing smells like a chat-template mismatch, not ternary-as-a-technique; a production-grade ternary model would fare better. The MoEs' misses were largely verbosity and output truncation, not raw inability. But the practical conclusion stands: the real winner on this box is a well-tuned dense small model. Qwen2.5-3B at 5/6 and ~8 tok/s is the daily-use sweet spot, with the 7B for maximum quality if you'll accept 3.9 tok/s. The exotic levers prove the frontier — a 30B runs at all — while the boring dense models win the day job.

The wall the levers can't move: prefill

Everything above is decode — generating tokens, bandwidth-bound. The other half of inference is prefill: reading your prompt before it answers. Prefill is compute-bound, and it's where the CPU loses worst, because compute is the exact thing this machine doesn't have. The 7B prefills at about 21 tok/s here. On an RTX 4090 it's around 5,000 — a ~240× gap (decode's gap to the same GPU was "only" ~34×). A 1,500-token prompt takes roughly 72 seconds just to read on the 7B.

No quantization or sparsity fixes this. Ternary and MoE shrink bytes; prefill is raw arithmetic throughput, which the levers can't touch. Long prompts are the CPU's hard ceiling, full stop — and it would be dishonest to sell local inference without saying so. Short prompts, conversational use, and routing are fine. Dumping a long document in and waiting is not.

Why speculative decoding doesn't transfer

Speculative decoding — a small draft model proposes K tokens, the big model verifies them in one parallel pass — gives GPUs roughly 3×. I expected it to help here. It gave about 1.1–1.4×, and the prefill number is exactly why.

The verify pass is a mini-prefill: K tokens, one weight-read. Decompose the 7B's per-token cost and the weight-read is ~254 ms while the compute is ~47 ms. A speculative cycle (draft 4, verify 4, ~65% accepted) spends its bandwidth saving on extra compute — the verify-compute and draft overhead eat almost the entire gain. Speculative decoding trades a bandwidth saving for a compute cost, and compute is the one resource this CPU lacks. The clean rule that falls out, and the one I'd bet on for any bandwidth-bound CPU: levers that reduce bytes win (ternary, MoE); levers that spend compute to save bandwidth (speculative) don't. Techniques are shaped by the hardware they were born on, and speculative decoding is GPU-shaped.

The CPU vs GPU gap, decomposed

Qwen-7B-Q4 ran at 4.14 tok/s here versus ~140 tok/s on a 4090 — a ~34× gap. That gap is exactly the achieved-bandwidth ratio: the 4090 sustains ~600 GB/s on this workload, this box sustains ~18. Not compute, not kernels, not "nobody optimized the CPU." Pure memory bandwidth. Which tells you precisely what you can and can't expect:

Same model, raw throughput — no. 30–40× is physics on fixed silicon, and ternary doesn't close the ratio (a GPU is bandwidth-bound too, so it benefits from ternary just as much).
A capable model at usable speed — yes. Ternary drags the CPU's absolute number across the "good enough" line. 7.6 tok/s for an 8B model with no GPU at all is genuinely usable.
On capacity, the CPU wins outright. RAM is cheap and huge. A ternary 70B fits in ~20 GB of RAM (~$50) and runs slowly; the same model needs $20k+ of GPUs just to fit in VRAM. That's the real answer to the GPU tax — not beating the GPU on speed, but making it unnecessary for a large class of work.

The payoff: a mostly-local hybrid

"Make the GPU unnecessary for a large class of work" isn't a slogan once you've measured the seam. The whole study answered two questions: can a CPU run a useful model? (yes) and is it as good as a frontier model? (no — the dense small ones top out at ~5/6). So route on exactly that seam: answer the easy majority locally, and escalate only the genuinely hard queries to a frontier model.

query → router ─easy─▶ Qwen2.5-3B on this box           (free, private, ~3s)
               └─hard─▶ Claude (frontier, via my proxy)  (~15s)

I built it — ~160 lines of dependency-free Python, open source. The local 3B handles facts, rewrites, and simple Q&A — the bulk of real use — and escalates proofs, code, and multi-step logic to Claude through a frontier proxy I run. On an eight-query demo, five stayed local (fast, free, private) and three escalated and came back frontier-quality. And here is where the study paid for itself in a way I didn't plan: wiring up the escalation path surfaced a real bug in my own infrastructure. The proxy was forwarding request parameters — a reasoning-effort hint and an oversized max_tokens — to model variants that don't accept them, so escalations to certain models failed. I traced it, clamped both for pre-effort and low-cap models, and shipped the fix and release (#539, #540) while the benchmark was still running. You don't find that bug by reading docs. You find it by running the thing.

Two results from the hybrid are worth more than the headline, and they're the limits, not the wins:

A cheap router inherits the cheap model's blind spots — even with verification. The obvious fix for a router gating a weak model is to make the router smarter than the tier it gates, so I built a verify-then-escalate version: the local model answers a few times and escalates when it disagrees with itself. It genuinely helps — it caught a power calculation the 3B computes unreliably. But it has a hard floor: self-consistency can't catch confident wrongness. The 3B answered "17⁴ = 6859" unanimously (it's 83,521), and walked into a rate-trap with 3/3 agreement on the wrong answer — confident, consistent, and wrong. A router built on the cheap model's own signals inherits its blind spots. The only escapes are category rules for known-weak domains, or a verifier stronger than the model — which is roughly a frontier call. No free lunch; that's the open problem, stated precisely.
"Local" isn't automatically fast — until you make it. The 3B first rambled ~25 seconds over-explaining a multiplication; a concise prompt cut that to ~3 seconds. The speed in the diagram is earned, not free.

But the shape holds, and it's the real answer to "comparable to a GPU": most of what you ask an LLM is easy, and easy runs free on hardware you already own. You don't beat the GPU — you stop needing it for the 90%, and spend a frontier call only on the 10% that earns it.

Reproduce it

Two notes that cost me real time, in case you're benchmarking on a managed Windows box. First: stock prebuilt llama.cpp binaries exited instantly with zero output and a cryptic status code. The cause wasn't the hardware — Windows Smart App Control was silently blocking the unsigned native DLLs before main() ran (check Microsoft-Windows-CodeIntegrity/Operational if you ever see this). The fix is to use a code-signed runtime: Ollama ships signed binaries that run fine under SAC. Second: the true ternary format that loads on a stock runtime is TQ2_0 with a llama architecture — Microsoft's official BitNet packaging is a different format that needs its own runtime and won't load here.

# 1. Signed runtime (passes Smart App Control where raw llama.cpp does not)
curl.exe -fsSL -o ollama.zip https://ollama.com/download/ollama-windows-amd64.zip
Expand-Archive ollama.zip -DestinationPath C:\llmgap\ollama
$env:OLLAMA_MODELS = "C:\llmgap\ollama\models"
Start-Process C:\llmgap\ollama\ollama.exe -ArgumentList serve -WindowStyle Hidden

# 2. True ternary: mainline TQ2_0 + llama arch => loads on stock Ollama
curl.exe -fsSL -o C:\llmgap\models\llama3-8b-1.58-tq2_0.gguf `
  https://huggingface.co/brunopio/Llama3-8B-1.58-100B-tokens-GGUF/resolve/main/Llama3-8B-1.58-100B-tokens-TQ2_0.gguf
"FROM C:\llmgap\models\llama3-8b-1.58-tq2_0.gguf" | Set-Content C:\llmgap\models\Modelfile
C:\llmgap\ollama\ollama.exe create llama3-tq2 -f C:\llmgap\models\Modelfile

# 3. Benchmark: the API returns exact token counts + nanosecond durations,
#    so decode tok/s = eval_count / eval_duration, and bandwidth = tok/s × file-bytes
$body = @{ model="llama3-tq2"; prompt="Write a detailed essay about the history of computing.";
           stream=$false; options=@{ num_predict=128; num_thread=8; temperature=0 } } | ConvertTo-Json
$r = Invoke-RestMethod http://127.0.0.1:11434/api/generate -Method Post -Body $body
"{0:N2} tok/s" -f ($r.eval_count / ($r.eval_duration/1e9))

The benchmark numbers are receipts because you can re-run them. The "bits/wt" works out to ~2.35 rather than the theoretical 1.58, because only the linear weights are ternary — embeddings, output, and norm layers stay higher-precision — and the GPU figures are representative published llama.cpp numbers, not measured on my hardware. Everything else came off the i7-4770.

And the router is a receipt too: github.com/askalf/hybrid is the whole thing in ~160 lines of stdlib Python. python hybrid.py --demo runs the routed test set with the confident-wrongness trap included; server.py exposes it as an OpenAI-compatible endpoint you can point any client at. Bring your own local model (Ollama) and any OpenAI-compatible frontier key.

Why this is an Own-Your-Stack story

A CPU can't out-run a GPU on the same model — that gap is bandwidth, and bandwidth is physics. But "the CPU is bad at this" is a myth. The CPU is bandwidth-starved, and ternary plus sparsity feed it exactly enough to run real models at real speed on hardware you already own. That's what Own Your Stack means one layer below the slogan: the intelligence you rent by the token, for the easy 90% of what you actually ask, can run on a $100 desktop in your own room — free, private, and yours to measure. The frontier call is worth paying for the 10% that earns it. Renting it for the rest is a tax you only keep paying because you never measured the alternative. I measured it. For a large and growing class of work, the GPU is optional.

Your CPU isn't bad at LLMs — it's bandwidth-starved.