The Transformer Architecture

Attention, Heads, Position Embeddings — and Why "Attention Is All You Need" Eight Years Later

In June 2017, Vaswani et al. published a paper that replaced recurrence and convolution with a single primitive: scaled dot-product attention. The Transformer processes every token in a sequence in parallel, learns long-range dependencies through a softmax over Q·K, and stacks identical residual blocks until you run out of GPUs. Every frontier LLM — GPT-4, Claude, Gemini, Llama — is a descendant of this 2017 architecture with three changes: more layers, better position embeddings, and a tweak to attention.

Architecture Map

A single Transformer block. Input flows up. Residuals skip around the two sublayers.

Key Numbers

2017

Original "Attention Is All You Need" paper (Vaswani et al.)

512

d_model in the original paper · GPT-3 used 12,288

Heads per layer in the original · Llama-3 70B uses 64

4×

FFN expansion ratio (d_ff = 4·d_model is the canonical choice)

O(n²)

Attention is quadratic in sequence length n — the cost driver

Layers in GPT-3 175B · Llama-3 405B has 126

~12·N·D²

FLOPs per training token, where N=layers, D=d_model (Chinchilla)

1. Scaled Dot-Product Attention

The core primitive. Three projections of the input, a dot product, a softmax, and a weighted sum.

For input X ∈ ℝ^(n×d), learn three weight matrices W_Q, W_K, W_V ∈ ℝ^(d×d_k). Then:

Q = X @ W_Q       # queries  (n × d_k)
K = X @ W_K       # keys     (n × d_k)
V = X @ W_V       # values   (n × d_v)

scores  = Q @ K.T / sqrt(d_k)         # (n × n) similarity
weights = softmax(scores, axis=-1)    # rows sum to 1
output  = weights @ V                 # (n × d_v)

The √d_k scaling is the trick. Without it, dot products scale with d_k, pushing the softmax into a region where gradients vanish. Vaswani found this empirically; later work formalized it as preserving variance through the layer.

For causal (autoregressive) attention, you mask the upper triangle of scores with −∞ before softmax so token i cannot attend to tokens j>i. That's the entire difference between an encoder and a decoder.

2. Multi-Head Attention

One attention is a single channel. Multiple heads let the model attend to different things in parallel.

Split d_model into h heads, each of dimension d_k = d_model / h. Run attention independently in each head, concatenate, then project:

def multi_head_attention(X, W_Q, W_K, W_V, W_O, h):
    n, d = X.shape
    d_k = d // h
    # reshape into (n, h, d_k) and run attention per head
    Q = (X @ W_Q).reshape(n, h, d_k).transpose(1, 0, 2)  # (h, n, d_k)
    K = (X @ W_K).reshape(n, h, d_k).transpose(1, 0, 2)
    V = (X @ W_V).reshape(n, h, d_k).transpose(1, 0, 2)
    scores = Q @ K.transpose(0, 2, 1) / sqrt(d_k)
    weights = softmax(scores, axis=-1)
    out = weights @ V                       # (h, n, d_k)
    out = out.transpose(1, 0, 2).reshape(n, d)
    return out @ W_O                        # output projection

Probing studies (Clark et al. 2019, "What Does BERT Look At?") found that different heads specialize: some attend to the previous token, some to the next, some to syntactic dependencies, some to coreference. Most heads are redundant — Michel et al. showed you can prune 40% of heads with minimal loss.

3. Position Embeddings: Absolute, RoPE, ALiBi

Attention has no notion of order. Position info must be injected. Three families, each with tradeoffs.

Scheme	How	Length generalization	Used by
Sinusoidal (original 2017)	Add fixed sin/cos at frequencies 10000^(2i/d) to embeddings	Poor beyond training length	Original Transformer
Learned absolute	One vector per position, learned end-to-end	Hard cutoff at max position	BERT, GPT-2
RoPE (Rotary)	Rotate Q,K pairs by an angle proportional to position before dot product	Good with NTK / YaRN scaling	Llama, Mistral, Qwen, GPT-NeoX
ALiBi	Subtract a linear penalty m·\|i−j\| from attention scores	Best — extrapolates 5-10× train length	BLOOM, MPT

RoPE won. The reason: it encodes relative position (the dot product after rotation depends only on i−j) but plugs into the existing Q,K computation without changing shape. For each pair of dimensions (2k, 2k+1), rotate by θ_k · pos where θ_k = 10000^(−2k/d).

# RoPE applied to Q, K (not V)
def rope(x, position):
    # x: (..., d) where d is even
    # split into pairs, rotate each pair
    half = x.shape[-1] // 2
    freqs = 10000 ** (-torch.arange(0, half) / half)
    angles = position * freqs                 # (half,)
    cos, sin = torch.cos(angles), torch.sin(angles)
    x1, x2 = x[..., :half], x[..., half:]
    return torch.cat([x1*cos - x2*sin, x1*sin + x2*cos], dim=-1)

4. The Residual + LayerNorm + FFN Sandwich

Two sublayers per block. Each wrapped in residual + norm. Then a 4× FFN.

The original paper used post-norm (apply LayerNorm after the residual). Modern models use pre-norm — apply LayerNorm to the input of each sublayer, then add the residual unmodified. Pre-norm trains more stably at depth ≥ 100 layers; post-norm needs careful warmup or it diverges.

# Pre-norm Transformer block (Llama, GPT-NeoX, Mistral)
def block(x):
    x = x + attn(rms_norm(x))      # attention sublayer
    x = x + ffn(rms_norm(x))       # feedforward sublayer
    return x

def ffn(x):                         # 4× hidden, gated (SwiGLU)
    a = silu(x @ W_gate)
    b = x @ W_up
    return (a * b) @ W_down

The FFN is where most parameters live. With d=4096 and d_ff=11008 (Llama's SwiGLU choice), the FFN has ~135M parameters per layer vs ~67M for attention.

Modern stacks also replace LayerNorm with RMSNorm: drop the mean-centering step, just divide by the root-mean-square of activations. ~7% faster, no measurable accuracy loss.

5. Autoregressive vs Encoder vs Encoder-Decoder

Variant	Mask	Trains on	Examples
Decoder-only (autoregressive)	Causal (lower-triangular)	Next-token prediction	GPT, Llama, Claude, Gemini
Encoder-only	Bidirectional (full)	Masked-language modeling	BERT, RoBERTa, DeBERTa
Encoder-decoder	Encoder bidir, decoder causal + cross-attention	Seq2seq (translation, summarization)	T5, BART, original Transformer

Decoder-only won the LLM race because next-token prediction is a self-supervised objective with infinite training data, and the same model can do generation, classification (via prompting), and embedding (via final hidden state). Encoder-only models still win on classification benchmarks per parameter, but the training data ceiling matters more.

6. Scaling Laws

The Kaplan (2020) and Chinchilla (2022) papers fit empirical curves to compute, parameters, and data. The Chinchilla finding: for a fixed compute budget, you should scale parameters and tokens roughly equally — about 20 tokens per parameter.

# Chinchilla rule of thumb
# Optimal training = ~20 tokens per parameter
# 7B model  → 140B tokens
# 70B model → 1.4T tokens
# 400B model → 8T tokens

# Kaplan loss curve (approximate)
L(N, D) ≈ A/N^α + B/D^β + L_irreducible
# α ≈ 0.34, β ≈ 0.28 for next-token cross-entropy

Llama-3 deliberately broke Chinchilla: 8B model trained on 15T tokens (1875 tokens/param). The reason — inference is way more important than training cost for deployed models, so over-train smaller architectures.

Tradeoffs

Pro	Con
Fully parallel training (no sequential RNN unroll)	O(n²) attention memory and compute
Long-range dependencies via direct token-to-token attention	KV cache grows linearly with context — dominates inference memory
Single architecture for text, code, vision (ViT), audio	Position generalization beyond training length is fragile
Scales predictably (Chinchilla, Kaplan laws)	FFN is a parameter sink; MoE only partially helps

FAQ

Why √d_k scaling specifically?

If Q and K have unit-variance entries, the dot product Q·K has variance d_k. Dividing by √d_k brings it back to unit variance, which keeps the softmax in a regime where it has gradient (not saturated near 0 or 1).

Is attention really O(n²) in practice?

Yes for compute, but FlashAttention reduces memory to O(n) by recomputing the attention matrix tile-by-tile in SRAM. The compute is still quadratic — that's why long-context models (1M tokens) are expensive even with FlashAttention.

Why did MoE not replace dense Transformers?

Mixture-of-Experts (Switch Transformer, Mixtral) routes each token to a subset of FFN experts. It improves parameter efficiency but introduces routing instability, load imbalance, and harder distributed training. Dense models still dominate frontier deployments because they're simpler and inference latency is more predictable.

What's the difference between RoPE and ALiBi for long context?

RoPE encodes position via rotation angle, so out-of-distribution positions produce out-of-distribution rotations. Techniques like NTK-aware scaling and YaRN re-interpolate the frequencies to extend context. ALiBi just adds a linear penalty to attention scores, which extrapolates naturally because the penalty grows monotonically.

Why are decoder-only models the dominant LLM architecture?

Three reasons: (1) next-token prediction is the only self-supervised objective that scales to all internet-scale text without labeling, (2) one model serves both generation and embedding tasks, (3) the engineering ecosystem (KV cache, speculative decoding, vLLM) is all built around causal attention.

How does the Transformer differ from the original 2017 paper today?

The skeleton is unchanged. Modern variants swap: pre-norm instead of post-norm, RMSNorm instead of LayerNorm, RoPE instead of sinusoidal, SwiGLU instead of ReLU FFN, GQA instead of full multi-head, and learned biases removed. None of these is fundamental — the 2017 architecture was already 95% right.