1. The Math: Why VRAM Is the Bottleneck
Local AI inference is fundamentally memory-bound. Before a single token can be generated, every weight of the model must be resident in the GPU's video memory (VRAM). If the model does not fit, it does not run โ period. Compute throughput (TFLOPs) is a secondary concern; without sufficient VRAM, no amount of raw processing power will compensate.
VRAM Formula
VRAM_required = (Parameters ร Bits) รท 8 + 20% Buffer
Example โ Llama 3 8B at FP16: (8 ร 10โน ร 16) รท 8 = 16 GB, + 20% buffer โ 19.2 GB required.
The 20% buffer accounts for activation tensors, the KV-cache (which grows with context length), CUDA kernel workspace, and framework overhead. Production workloads with long contexts or batched requests may require an even larger margin. Insufficient headroom manifests as out-of-memory (OOM) errors the moment context grows beyond a trivial prompt.