Tokenization

BPE, WordPiece, SentencePiece, tiktoken — and Why "Tokenizers Are the New Achilles' Heel"

Before any forward pass, a string of text becomes a sequence of integers. The function that performs this mapping — the tokenizer — is trained once, frozen before pretraining, and lives forever in the model. It's a deterministic vocabulary lookup, but it controls cost (tokens billed), context length (in tokens not chars), language fairness (Burmese costs 8× English), and the surprisingly long list of bugs labeled "tokenization issues." The Big Three families are BPE, WordPiece, and Unigram (SentencePiece).

The Tokenization Pipeline

Text in, integers out. Three stages: normalize, pre-tokenize, encode.

Raw text "Hello, world!" Normalize NFC, lowercase? Pre-tokenize split on whitespace Subword merges (BPE / WordPiece / Unigram) "Hello"+"," → [9906, 11] [9906, 11, 1917, 0]

Key Numbers

50,257
GPT-2 vocabulary size · GPT-4o uses 200,019
~4 chars
Average token length in English (cl100k_base)
Cost ratio: same text in Burmese vs English (GPT-3.5)
256
Base alphabet size of byte-level BPE — every byte is a token
2015
Sennrich BPE paper — applied to MT, then everything
128,000
Llama-3 vocabulary size (4× larger than Llama-2)
~750
Tokens per page of English text (industry rule of thumb)

1. Byte-Pair Encoding (BPE)

Greedily merge the most frequent adjacent pair. Repeat until vocab size hit.

BPE was invented for compression in 1994 (Gage), then re-invented for NMT by Sennrich, Haddow & Birch in 2015. The algorithm:

# Train BPE from a corpus
def train_bpe(corpus, vocab_size):
    # 1. Start with character-level vocab
    vocab = set(''.join(corpus))
    # 2. Count pairs in corpus
    while len(vocab) < vocab_size:
        pairs = count_adjacent_pairs(corpus)
        best = max(pairs, key=pairs.get)        # most common pair
        corpus = merge_pair(corpus, best)       # replace pair → token
        vocab.add(best[0] + best[1])
    return vocab

# Example trace on "low, lower, newest, widest"
# l,o,w  l,o,w,e,r  n,e,w,e,s,t  w,i,d,e,s,t
# merge "e,s" → "es"   (most common pair)
# l,o,w  l,o,w,e,r  n,e,w,es,t  w,i,d,es,t
# merge "es,t" → "est"
# l,o,w  l,o,w,e,r  n,e,w,est   w,i,d,est

Key properties: deterministic, no <UNK> needed if you start from bytes, vocabulary is a function of training corpus frequency. The downside: the merge sequence is greedy and locally optimal — there's no global re-segmentation.

2. Byte-Level BPE (GPT-2 onwards)

BPE on UTF-8 bytes, not Unicode characters. The vocab base is 256, no OOV ever.

Standard BPE has a problem: rare Unicode characters explode the base vocab, or get mapped to <UNK>. GPT-2 (Radford 2019) fixed this by working on raw UTF-8 bytes:

# Every byte 0–255 is a valid token. No OOV possible.
# But we need to remap "weird" bytes (whitespace, control chars)
# to printable Unicode for the merge table to be readable.

def bytes_to_unicode():
    bs = list(range(33, 127)) + list(range(161, 173)) + list(range(174, 256))
    cs = bs[:]
    n = 0
    for b in range(256):
        if b not in bs:
            bs.append(b)
            cs.append(256 + n)
            n += 1
    return dict(zip(bs, [chr(c) for c in cs]))

# Now BPE merges happen over this remapped space.
# At decode time, every byte is recoverable.

This is what tiktoken implements (in Rust, fast). It's also what HuggingFace's GPT2Tokenizer does. The 256-byte base means any byte sequence is encodable, including invalid UTF-8 — which matters for code, binary, and adversarial inputs.

3. WordPiece (BERT)

Like BPE but merges based on likelihood, not frequency.

WordPiece (Schuster & Nakajima 2012, used by BERT) picks the merge that maximizes the language-model likelihood of the corpus:

# BPE merge criterion:        argmax  count(a, b)
# WordPiece merge criterion:  argmax  count(a, b) / (count(a) * count(b))
#                            ↑ pointwise mutual information

# Practical effect: WordPiece prefers merges where a and b
# co-occur more than chance, even if absolute count is lower.

WordPiece also marks subwords with ## prefix: "playing" → ["play", "##ing"]. This makes detokenization unambiguous but adds a special token to the vocab. BERT used a 30k-token WordPiece vocab. WordPiece feels old now — every modern LLM uses BPE or Unigram instead.

4. SentencePiece + Unigram (Llama, T5)

Treat the whole input as a sequence of bytes (no whitespace split). Unigram model picks the most likely segmentation.

SentencePiece (Kudo 2018) is a library; Unigram is its preferred algorithm. It works top-down:

# 1. Start with a huge candidate vocab (e.g., 1M substrings)
# 2. Assign each token a probability via EM
# 3. Compute loss if we drop each token from vocab
# 4. Drop the bottom 10% (lowest contribution to likelihood)
# 5. Re-run EM, repeat until vocab size = target

# At inference, tokenize via Viterbi over the unigram LM:
#   "lowest" could be ["low", "est"] or ["lo", "west"] or ["l", "owest"]
#   Pick the path with max product of token probabilities.

Unigram is not greedy — it considers all segmentations and picks the most likely. This produces shorter, more linguistically reasonable tokens, especially for non-English. It also enables subword regularization (sample alternative segmentations during training) for free.

SentencePiece treats whitespace as a special character (), so "Hello world" becomes "▁Hello▁world" and there's no separate pre-tokenization step. Llama-1, Llama-2, T5, mBART all use SentencePiece+Unigram.

5. tiktoken: The Production Tokenizer

OpenAI's open-source BPE implementation in Rust, with Python bindings. Fast, deterministic, identical to what runs in production.

import tiktoken

# OpenAI uses different encodings per model family
enc = tiktoken.encoding_for_model("gpt-4o")     # o200k_base
ids = enc.encode("Hello, world!")
# [13225, 11, 2375, 0]
text = enc.decode(ids)
# "Hello, world!"

# Available encodings:
# - cl100k_base    GPT-3.5, GPT-4 (100,256 tokens)
# - o200k_base     GPT-4o, o1 (200,019 tokens)
# - p50k_base      Codex, text-davinci (50,281 tokens)
# - r50k_base      GPT-3 (50,257 tokens)

The Rust core uses a precomputed merge table and a regex split (via fancy-regex) for pre-tokenization. The pre-tokenizer regex for cl100k_base is famous — it splits on contractions, numbers, whitespace runs, and punctuation in a specific order that took empirical tuning.

6. Vocabulary Size Tradeoffs

Vocab sizeEmbedding paramsAvg tokens/wordTradeoff
10k40M (4096d)~1.6Compact but shatters rare words
32k (Llama-2)131M~1.3Good for English, weak on CJK
100k (GPT-4)410M~1.15Better multilingual; bigger softmax
200k (GPT-4o)820M~1.05Strong multilingual + code; heavy embedding layer

The embedding matrix (and tied output head) scale linearly with vocab size. For a 4096-dim model, a 200k vocab burns ~820M params on the embedding alone. But the savings in token count (~10% shorter sequences) pay for it during inference, since the per-token cost dwarfs the one-time embedding lookup.

7. Why "Tokenizers Are the New Achilles' Heel"

A long list of LLM bugs trace to tokenization:

  • Reverse-string fails: "lollipop" reversed → wrong, because the model sees three tokens, not eight characters.
  • Counting letters fails: "How many R's in strawberry?" → 2, because "berry" is one token.
  • Glitch tokens: SolidGoldMagikarp existed in the BPE vocab but was almost absent from training data; the model produced unhinged output when prompted with it.
  • Code indentation issues: a single space and four spaces are different tokens, so the model has to learn each indentation level separately.
  • Multilingual fairness: the same paragraph in Burmese costs ~8× the tokens of English. Same API price, less context, slower output.
  • Cross-tokenizer training: embeddings don't transfer between vocabularies, so distillation requires re-training the student tokenizer.

Andrej Karpathy's "Let's build the GPT Tokenizer" video crystallized the term — tokenization is the boring layer that breaks in surprising ways.

Tradeoffs

AlgorithmStrengthsWeaknesses
BPE (byte-level)No OOV, fast, simple, dominates LLM spaceGreedy merges are suboptimal; whitespace bugs
WordPiecePMI-based, BERT-era provenNeeds pre-tokenizer; ## prefix complexity
Unigram (SentencePiece)Globally optimal; subword regularization; multilingualSlower training; harder to interpret merges
Character / byte-onlyNo vocab issues at all10× longer sequences → 100× attention cost

FAQ

Why is "strawberry" tokenized weirdly?

tiktoken encodes "strawberry" as ["str", "aw", "berry"] — three tokens. The model has no character-level access; it learns letter-counting only via spelling-out training data. That's why frontier models often miscount letters in single tokens.

Can I extend a tokenizer's vocabulary?

Yes — append new tokens and resize the embedding matrix. But the new embeddings start untrained, so you need finetuning to make them useful. Llama-3's vocab jump from 32k to 128k required full retraining.

Why doesn't anyone use character-level tokenization?

Sequence length explodes 4–6×. Attention is O(n²), so you'd pay 16–36× the compute. Even ByT5 (Google's byte-level T5) needed deeper layers to compensate, ending up slower at the same quality.

How does tiktoken differ from HuggingFace tokenizers?

tiktoken is purely BPE in Rust, ~10× faster than HF's pure-Python fallback. HF's tokenizers library (also Rust) supports more algorithms (WordPiece, Unigram, BPE) and the full normalize → pre-tokenize → encode pipeline as composable building blocks.

What's a "glitch token"?

A token that ended up in the vocabulary because some weird scraper artifact appeared frequently in the BPE training corpus, but didn't make it into the LLM training corpus (or appeared rarely). The token's embedding is essentially random — feeding it produces unhinged output.