Bandwidth, not FLOPS
Why memory bandwidth dictates local LLM throughput more than raw compute, and what that means for hardware choice.
Local LLM inference is memory-bound, not compute-bound. The bottleneck isn't how fast your GPU can multiply matrices — it's how fast it can stream model weights from memory into the cores.
A good rule of thumb:
tokens/sec ≈ (bandwidth_GBs / model_size_GB) × efficiency
Efficiency is roughly 0.5–0.7 for well-optimized runtimes on consumer hardware. So a 7B model at Q4 (~4 GB) on an M2 Pro (200 GB/s) tops out near (200 / 4) × 0.6 ≈ 30 t/s.
What this changes
- A high-end CPU with DDR5 (~80 GB/s) will outrun a low-VRAM GPU on a model that doesn't fit in VRAM at all — but lose badly on a model that does.
- Apple silicon's unified memory wins on fit (you can run a 70B model at low quant), but its bandwidth is modest vs a discrete GPU. So it fits big models but runs them slowly.
- A 24GB RTX 3090 (936 GB/s) eats most 13B–34B models for breakfast — until they stop fitting.
The Hardware Checker uses this formula directly. Pick a device, see what fits, and read the speed estimate next to memory headroom.