Design Search Autocomplete — trie, ranking, debouncing, typo tolerance

Architecture

Capacity Estimation

Metric	Value	Notes
Queries/s peak	~500 K	Google-scale autocomplete
p99 latency	< 100 ms	round-trip + render
Trie nodes	~50 M	10 M unique queries × avg 5 chars
Trie memory	~5 GB	per shard, with top-K cached
Logs ingested	~1 B/day	search submissions
Index refresh	1 h batch + 1 m streaming	hot terms
Cache hit rate	~70%	at edge

Trie / Prefix Tree

The data structure: each node represents a prefix; child pointers follow the next character. At each node, store the top-K completions with their pre-computed scores. On query, walk the trie to the prefix node and return its cached top-K.

Memory: ~10–40 bytes per node, plus K × (string + score) at each. Compressed tries (radix tree, DAWG) collapse single-child chains.
Build: aggregate query logs into (query, frequency); insert into trie; compute and cache top-K per node bottom-up.
Update: hourly batch rebuild for the bulk; in-memory delta trie for the latest hot terms; merge at query time.

For massive corpora, an FST (Finite State Transducer) is more memory-efficient than a trie. Lucene uses FSTs internally for autocomplete.

Ranking

The score for a completion is a weighted sum:

score = w1 · popularity + w2 · recency + w3 · personalization + w4 · commerce_signal

Popularity — query frequency over the trailing 30 days. Log-scale to compress the long tail.
Recency — recent query bias for trending terms. Decaying counter (EWMA) per query.
Personalization — user's own past queries, location, language. Applied as a re-rank on top of the global top-K rather than a separate trie per user.
Commerce (e-commerce autocomplete) — bias toward in-stock high-margin SKUs.

Personalization is a re-rank, not a re-fetch: get top-50 from the global trie, blend with user history, return top-10.

Client-side Debouncing

Without debouncing, every keystroke sends a request: a 10-character query produces 10 requests, 9 wasted. Strategies:

Time debounce — wait 100–200 ms after the last keystroke before firing. Cancels in-flight requests for older prefixes.
Pre-fetch on focus — fetch the top-K for the empty prefix when the box is focused; show as the "recent" / "trending" menu.
Local cache — keep results for prefixes the user just typed; backspace returns instantly.
Race-cancel — a new keystroke aborts any pending request; only the freshest result renders.

Debouncing is the single biggest server-load reduction. 80% of would-be requests never leave the browser.

Edit Distance and Typo Tolerance

Users mistype. "itlay" should still suggest "italy". Approaches:

Levenshtein automaton — build an NFA that accepts strings within K edits of the prefix; intersect with the trie / FST. O(prefix × edit_distance) per node visited; standard in Lucene.
Phonetic encoding (Soundex, Metaphone) — index by phonetic key; "Knight" and "Nite" share the same phonetic. Catches sound-alike misspellings.
Symspell — precompute deletes for every dictionary word; lookup the misspelled query against the deletes table. Memory-heavy but blazingly fast.

Tier the response: exact-match completions first, then within-1-edit, then within-2-edits. Stop at the first tier that yields enough results.

The 100 ms p99 Budget

Where the time goes:

Network round-trip: 30–60 ms (geography). Mitigation: anycast + edge nodes.
Edge cache: 1–3 ms on hit; if hit, you're done.
Origin trie lookup: 1–5 ms in-memory; the trie is hot.
Personalization re-rank: 5–20 ms; usually the longest origin step.
Render: 5–20 ms in the browser.

Cache aggressively: prefix-only results at the CDN (no personalization needed when prefix is short and high-frequency). For personalized results, cache per-user for the session.

Index Pipeline

Where new completions come from:

Search submission logs — the dominant signal. Aggregate (query, count, last_seen); filter spam (per-IP rate, bot detection); filter sensitive terms (NSFW, PII).
Catalog (e-commerce) — product titles + categories.
Editorial — manual whitelists for new terms (a movie launch, a news event).

Pipeline: Kafka → Spark / Flink aggregation → trie builder → deploy via blue-green swap. The hot-term streaming path skips the batch and merges into the in-memory delta trie within a minute.

Failure Modes

Spammy queries pollute the trie — bot-driven query farms make nonsense terms top suggestions. Filter at ingest by per-IP / per-account rate; require > N distinct submitters before a term is eligible.
Stale trie after rebuild bug — pipeline fails silently; suggestions go cold. Alert on rebuild age.
Personalization leak — user A's recent searches appear in user B's autocomplete due to caching key mistake. Always vary on user_id; never cache personalized results across users.
Edge cache stampede — viral prefix invalidates; everyone misses simultaneously. Soft-TTL + background refresh.

FAQ

Trie or Elasticsearch?

For pure-prefix autocomplete on a fixed corpus, in-memory trie wins on latency. Elasticsearch's edge-ngram analyzer or completion suggester works for general search; ~10–30 ms latency.

How do you handle CJK languages?

Unicode-correct tokenization. Chinese / Japanese / Korean prefix is per-character; segment large queries at word boundaries (jieba, MeCab) for ranking.

What about query-time A/B testing?

Variant assigned at the gateway; encoded in the request to autocomplete service; different ranking weights or different feature flags. Track CTR per variant downstream.

Should I show search history?

Yes for logged-in users (privacy controls required); no for anonymous (creepy). Recent history outranks general popularity for the user.