ποΈ KV Cache & Inference Optimization
Why autoregressive generation is expensive β and how caching makes it practical
Autoregressive Generation
Each new token requires attending to ALL previous tokens. Without caching, we recompute K and V for every token at every step β the red cells show wasted work.
Computation per step (each column = one generation step):
Recomputed (wasted without cache) Cached (skipped) New computation
0Ops without cache
0Ops with cache
0%Savings
KV Cache Memory
KV cache stores Key and Value tensors for all previous tokens across all layers and heads. Memory = layers Γ heads Γ seq_len Γ head_dim Γ 2 (K+V) Γ bytes.
KV Cache: 0
Sequence Length Impact
Attention is O(nΒ²) in sequence length, but KV cache grows linearly. Drag the slider to see how costs scale.
Attention compute (O(nΒ²)) KV cache memory (O(n)) Current position
PagedAttention (vLLM)
Like virtual memory for KV cache. Logical blocks map to physical GPU memory pages, enabling efficient memory sharing across requests.
Logical Blocks (per request)
β