🗄️ KV Cache & Inference Optimization

Why autoregressive generation is expensive — and how caching makes it practical

Autoregressive Generation

Each new token requires attending to ALL previous tokens. Without caching, we recompute K and V for every token at every step — the red cells show wasted work.

KV Cache

Computation per step (each column = one generation step):

Recomputed (wasted without cache) Cached (skipped) New computation

0Ops without cache

0Ops with cache

0%Savings

KV Cache Memory

KV cache stores Key and Value tensors for all previous tokens across all layers and heads. Memory = layers × heads × seq_len × head_dim × 2 (K+V) × bytes.

Layers

Heads

Head Dim

Precision

Sequence Length: 2048

1285122K8K32K128K

KV Cache: 0

Sequence Length Impact

Attention is O(n²) in sequence length, but KV cache grows linearly. Drag the slider to see how costs scale.

Attention compute (O(n²)) KV cache memory (O(n)) Current position

PagedAttention (vLLM)

Like virtual memory for KV cache. Logical blocks map to physical GPU memory pages, enabling efficient memory sharing across requests.

Logical Blocks (per request)

→

Autoregressive Generation

KV Cache Memory

Sequence Length Impact

PagedAttention (vLLM)

Logical Blocks (per request)

Physical GPU Memory