Technical Whitepaper

How It Works & Hardware Guide

A concise primer on the math behind local AI inference, the architectural differences between PC and Mac platforms, and the role of quantization in determining what models your hardware can actually run.

1. The Math: Why VRAM Is the Bottleneck

Local AI inference is fundamentally memory-bound. Before a single token can be generated, every weight of the model must be resident in the GPU's video memory (VRAM). If the model does not fit, it does not run — period. Compute throughput (TFLOPs) is a secondary concern; without sufficient VRAM, no amount of raw processing power will compensate.

VRAM Formula

VRAM_required = (Parameters × Bits) ÷ 8 + 20% Buffer

Example — Llama 3 8B at FP16: (8 × 10⁹ × 16) ÷ 8 = 16 GB, + 20% buffer ≈ 19.2 GB required.

The 20% buffer accounts for activation tensors, the KV-cache (which grows with context length), CUDA kernel workspace, and framework overhead. Production workloads with long contexts or batched requests may require an even larger margin. Insufficient headroom manifests as out-of-memory (OOM) errors the moment context grows beyond a trivial prompt.

2. Architectural Comparison: Windows/PC (NVIDIA) vs Apple Silicon (Mac)

The two dominant consumer platforms for local AI take fundamentally different approaches to memory. NVIDIA discrete GPUs ship with dedicated, high-bandwidth GDDR or HBM memory soldered to the card. Apple Silicon uses a Unified Memory Architecture (UMA) in which the CPU, GPU, and Neural Engine share a single pool of LPDDR5 system RAM.

Specification	Windows / PC (NVIDIA)	Apple Silicon (Mac)
Memory Type	Dedicated GDDR6 / GDDR6X / HBM3	Unified LPDDR5 (shared CPU + GPU)
Bandwidth (Speed)	~500 GB/s (RTX 4070) — 3,350 GB/s (H100)	~400 GB/s (M3 Pro) — 800 GB/s (M2/M3 Ultra)
Max VRAM (Consumer)	32 GB (RTX 5090)	Up to 192 GB (M2 Ultra) / 512 GB (M3 Ultra)
Effective AI Memory	100% of VRAM	~75% of System RAM usable as Unified Memory for AI
Software Ecosystem	CUDA, cuDNN, TensorRT, vLLM, llama.cpp	MLX, Metal, llama.cpp (Metal backend)
Strength	Raw throughput, training, ecosystem maturity	Massive model capacity, power efficiency

The practical implication: a Mac Studio with 192 GB of unified memory can load a quantized 70B-parameter model that would require multiple datacenter-class GPUs on the PC side — but at roughly one-quarter the memory bandwidth of an H100, resulting in lower tokens-per-second throughput. By default, macOS reserves approximately 25% of system RAM for the OS and CPU workloads; the remaining ~75% is addressable by the GPU as effective VRAM. Power users can raise this ceiling with the iogpu.wired_limit_mb sysctl, but the 75% rule is the safe default.

3. Quantization: The Performance vs Memory Trade-off

Quantization reduces the numerical precision used to store each model weight, shrinking the memory footprint at a controlled cost in output quality. Modern quantization schemes (GPTQ, AWQ, GGUF k-quants) preserve the vast majority of model accuracy even at aggressive bit-widths.

FP16 / BF16

16-bit

Memory:: 100% (baseline)
Quality:: Reference accuracy

INT8 / Q8

8-bit

Memory:: ~50% of FP16
Quality:: ≥99% of FP16 quality

Q4 (4-bit)

4-bit

Memory:: ~25% of FP16
Quality:: ~97–98% of FP16 quality

A 70B-parameter model that requires roughly 168 GB at FP16 becomes feasible on a single 48 GB GPU when quantized to 4-bit (~42 GB + buffer). For most chat, coding, and reasoning workloads the perceptible quality difference between FP16 and Q4 is negligible — making 4-bit quantization the de facto standard for local deployment.

All formulas, benchmarks, and recommendations are derived from publicly available technical documentation and are provided for educational purposes. See our Terms for the full disclaimer.

← Back to the Matchmaker