Bandwidth, not FLOPS

Local LLM inference is memory-bound, not compute-bound. The bottleneck isn't how fast your GPU can multiply matrices — it's how fast it can stream model weights from memory into the cores.

A good rule of thumb:

tokens/sec ≈ (bandwidth_GBs / model_size_GB) × efficiency

Efficiency is roughly 0.5–0.7 for well-optimized runtimes on consumer hardware. So a 7B model at Q4 (~4 GB) on an M2 Pro (200 GB/s) tops out near (200 / 4) × 0.6 ≈ 30 t/s.

What this changes

A high-end CPU with DDR5 (~80 GB/s) will outrun a low-VRAM GPU on a model that doesn't fit in VRAM at all — but lose badly on a model that does.
Apple silicon's unified memory wins on fit (you can run a 70B model at low quant), but its bandwidth is modest vs a discrete GPU. So it fits big models but runs them slowly.
A 24GB RTX 3090 (936 GB/s) eats most 13B–34B models for breakfast — until they stop fitting.

The Hardware Checker uses this formula directly. Pick a device, see what fits, and read the speed estimate next to memory headroom.