The Transformer Architecture
Attention, Heads, Position Embeddings — and Why "Attention Is All You Need" Eight Years Later
In June 2017, Vaswani et al. published a paper that replaced recurrence and convolution with a single primitive: scaled dot-product attention. The Transformer processes every token in a sequence in parallel, learns long-range dependencies through a softmax over Q·K, and stacks identical residual blocks until you run out of GPUs. Every frontier LLM — GPT-4, Claude, Gemini, Llama — is a descendant of this 2017 architecture with three changes: more layers, better position embeddings, and a tweak to attention.
Architecture Map
A single Transformer block. Input flows up. Residuals skip around the two sublayers.
Key Numbers
1. Scaled Dot-Product Attention
The core primitive. Three projections of the input, a dot product, a softmax, and a weighted sum.
For input X ∈ ℝ^(n×d), learn three weight matrices W_Q, W_K, W_V ∈ ℝ^(d×d_k). Then:
Q = X @ W_Q # queries (n × d_k)
K = X @ W_K # keys (n × d_k)
V = X @ W_V # values (n × d_v)
scores = Q @ K.T / sqrt(d_k) # (n × n) similarity
weights = softmax(scores, axis=-1) # rows sum to 1
output = weights @ V # (n × d_v) The √d_k scaling is the trick. Without it, dot products scale with d_k, pushing the softmax into a region where gradients vanish. Vaswani found this empirically; later work formalized it as preserving variance through the layer.
For causal (autoregressive) attention, you mask the upper triangle of scores with −∞ before softmax so token i cannot attend to tokens j>i. That's the entire difference between an encoder and a decoder.
2. Multi-Head Attention
One attention is a single channel. Multiple heads let the model attend to different things in parallel.
Split d_model into h heads, each of dimension d_k = d_model / h. Run attention independently in each head, concatenate, then project:
def multi_head_attention(X, W_Q, W_K, W_V, W_O, h):
n, d = X.shape
d_k = d // h
# reshape into (n, h, d_k) and run attention per head
Q = (X @ W_Q).reshape(n, h, d_k).transpose(1, 0, 2) # (h, n, d_k)
K = (X @ W_K).reshape(n, h, d_k).transpose(1, 0, 2)
V = (X @ W_V).reshape(n, h, d_k).transpose(1, 0, 2)
scores = Q @ K.transpose(0, 2, 1) / sqrt(d_k)
weights = softmax(scores, axis=-1)
out = weights @ V # (h, n, d_k)
out = out.transpose(1, 0, 2).reshape(n, d)
return out @ W_O # output projection Probing studies (Clark et al. 2019, "What Does BERT Look At?") found that different heads specialize: some attend to the previous token, some to the next, some to syntactic dependencies, some to coreference. Most heads are redundant — Michel et al. showed you can prune 40% of heads with minimal loss.
3. Position Embeddings: Absolute, RoPE, ALiBi
Attention has no notion of order. Position info must be injected. Three families, each with tradeoffs.
| Scheme | How | Length generalization | Used by |
|---|---|---|---|
| Sinusoidal (original 2017) | Add fixed sin/cos at frequencies 10000^(2i/d) to embeddings | Poor beyond training length | Original Transformer |
| Learned absolute | One vector per position, learned end-to-end | Hard cutoff at max position | BERT, GPT-2 |
| RoPE (Rotary) | Rotate Q,K pairs by an angle proportional to position before dot product | Good with NTK / YaRN scaling | Llama, Mistral, Qwen, GPT-NeoX |
| ALiBi | Subtract a linear penalty m·|i−j| from attention scores | Best — extrapolates 5-10× train length | BLOOM, MPT |
RoPE won. The reason: it encodes relative position (the dot product after rotation depends only on i−j) but plugs into the existing Q,K computation without changing shape. For each pair of dimensions (2k, 2k+1), rotate by θ_k · pos where θ_k = 10000^(−2k/d).
# RoPE applied to Q, K (not V)
def rope(x, position):
# x: (..., d) where d is even
# split into pairs, rotate each pair
half = x.shape[-1] // 2
freqs = 10000 ** (-torch.arange(0, half) / half)
angles = position * freqs # (half,)
cos, sin = torch.cos(angles), torch.sin(angles)
x1, x2 = x[..., :half], x[..., half:]
return torch.cat([x1*cos - x2*sin, x1*sin + x2*cos], dim=-1) 4. The Residual + LayerNorm + FFN Sandwich
Two sublayers per block. Each wrapped in residual + norm. Then a 4× FFN.
The original paper used post-norm (apply LayerNorm after the residual). Modern models use pre-norm — apply LayerNorm to the input of each sublayer, then add the residual unmodified. Pre-norm trains more stably at depth ≥ 100 layers; post-norm needs careful warmup or it diverges.
# Pre-norm Transformer block (Llama, GPT-NeoX, Mistral)
def block(x):
x = x + attn(rms_norm(x)) # attention sublayer
x = x + ffn(rms_norm(x)) # feedforward sublayer
return x
def ffn(x): # 4× hidden, gated (SwiGLU)
a = silu(x @ W_gate)
b = x @ W_up
return (a * b) @ W_down The FFN is where most parameters live. With d=4096 and d_ff=11008 (Llama's SwiGLU choice), the FFN has ~135M parameters per layer vs ~67M for attention.
Modern stacks also replace LayerNorm with RMSNorm: drop the mean-centering step, just divide by the root-mean-square of activations. ~7% faster, no measurable accuracy loss.
5. Autoregressive vs Encoder vs Encoder-Decoder
| Variant | Mask | Trains on | Examples |
|---|---|---|---|
| Decoder-only (autoregressive) | Causal (lower-triangular) | Next-token prediction | GPT, Llama, Claude, Gemini |
| Encoder-only | Bidirectional (full) | Masked-language modeling | BERT, RoBERTa, DeBERTa |
| Encoder-decoder | Encoder bidir, decoder causal + cross-attention | Seq2seq (translation, summarization) | T5, BART, original Transformer |
Decoder-only won the LLM race because next-token prediction is a self-supervised objective with infinite training data, and the same model can do generation, classification (via prompting), and embedding (via final hidden state). Encoder-only models still win on classification benchmarks per parameter, but the training data ceiling matters more.
6. Scaling Laws
The Kaplan (2020) and Chinchilla (2022) papers fit empirical curves to compute, parameters, and data. The Chinchilla finding: for a fixed compute budget, you should scale parameters and tokens roughly equally — about 20 tokens per parameter.
# Chinchilla rule of thumb
# Optimal training = ~20 tokens per parameter
# 7B model → 140B tokens
# 70B model → 1.4T tokens
# 400B model → 8T tokens
# Kaplan loss curve (approximate)
L(N, D) ≈ A/N^α + B/D^β + L_irreducible
# α ≈ 0.34, β ≈ 0.28 for next-token cross-entropy Llama-3 deliberately broke Chinchilla: 8B model trained on 15T tokens (1875 tokens/param). The reason — inference is way more important than training cost for deployed models, so over-train smaller architectures.
Tradeoffs
| Pro | Con |
|---|---|
| Fully parallel training (no sequential RNN unroll) | O(n²) attention memory and compute |
| Long-range dependencies via direct token-to-token attention | KV cache grows linearly with context — dominates inference memory |
| Single architecture for text, code, vision (ViT), audio | Position generalization beyond training length is fragile |
| Scales predictably (Chinchilla, Kaplan laws) | FFN is a parameter sink; MoE only partially helps |
FAQ
Why √d_k scaling specifically?
If Q and K have unit-variance entries, the dot product Q·K has variance d_k. Dividing by √d_k brings it back to unit variance, which keeps the softmax in a regime where it has gradient (not saturated near 0 or 1).
Is attention really O(n²) in practice?
Yes for compute, but FlashAttention reduces memory to O(n) by recomputing the attention matrix tile-by-tile in SRAM. The compute is still quadratic — that's why long-context models (1M tokens) are expensive even with FlashAttention.
Why did MoE not replace dense Transformers?
Mixture-of-Experts (Switch Transformer, Mixtral) routes each token to a subset of FFN experts. It improves parameter efficiency but introduces routing instability, load imbalance, and harder distributed training. Dense models still dominate frontier deployments because they're simpler and inference latency is more predictable.
What's the difference between RoPE and ALiBi for long context?
RoPE encodes position via rotation angle, so out-of-distribution positions produce out-of-distribution rotations. Techniques like NTK-aware scaling and YaRN re-interpolate the frequencies to extend context. ALiBi just adds a linear penalty to attention scores, which extrapolates naturally because the penalty grows monotonically.
Why are decoder-only models the dominant LLM architecture?
Three reasons: (1) next-token prediction is the only self-supervised objective that scales to all internet-scale text without labeling, (2) one model serves both generation and embedding tasks, (3) the engineering ecosystem (KV cache, speculative decoding, vLLM) is all built around causal attention.
How does the Transformer differ from the original 2017 paper today?
The skeleton is unchanged. Modern variants swap: pre-norm instead of post-norm, RMSNorm instead of LayerNorm, RoPE instead of sinusoidal, SwiGLU instead of ReLU FFN, GQA instead of full multi-head, and learned biases removed. None of these is fundamental — the 2017 architecture was already 95% right.