πŸ—„οΈ KV Cache & Inference Optimization

Why autoregressive generation is expensive β€” and how caching makes it practical

Autoregressive Generation

Each new token requires attending to ALL previous tokens. Without caching, we recompute K and V for every token at every step β€” the red cells show wasted work.

Computation per step (each column = one generation step):
Recomputed (wasted without cache) Cached (skipped) New computation
0Ops without cache
0Ops with cache
0%Savings

KV Cache Memory

KV cache stores Key and Value tensors for all previous tokens across all layers and heads. Memory = layers Γ— heads Γ— seq_len Γ— head_dim Γ— 2 (K+V) Γ— bytes.

1285122K8K32K128K
KV Cache: 0

Sequence Length Impact

Attention is O(nΒ²) in sequence length, but KV cache grows linearly. Drag the slider to see how costs scale.

Attention compute (O(nΒ²)) KV cache memory (O(n)) Current position

PagedAttention (vLLM)

Like virtual memory for KV cache. Logical blocks map to physical GPU memory pages, enabling efficient memory sharing across requests.

Logical Blocks (per request)

β†’

Physical GPU Memory