Design Search Autocomplete
Autocomplete is a deceptively simple feature with brutal latency requirements: every keystroke must round-trip to the server and render in < 100 ms p99. The system must handle billions of prefix lookups per day, return relevance-ranked results that include personalization, tolerate typos, and update its corpus as the world changes (new movies, news terms, products). The architecture is a trie + ranking + debouncing trifecta with caching at every layer.
Architecture
Capacity Estimation
| Metric | Value | Notes |
|---|---|---|
| Queries/s peak | ~500 K | Google-scale autocomplete |
| p99 latency | < 100 ms | round-trip + render |
| Trie nodes | ~50 M | 10 M unique queries × avg 5 chars |
| Trie memory | ~5 GB | per shard, with top-K cached |
| Logs ingested | ~1 B/day | search submissions |
| Index refresh | 1 h batch + 1 m streaming | hot terms |
| Cache hit rate | ~70% | at edge |
Trie / Prefix Tree
The data structure: each node represents a prefix; child pointers follow the next character. At each node, store the top-K completions with their pre-computed scores. On query, walk the trie to the prefix node and return its cached top-K.
- Memory: ~10–40 bytes per node, plus K × (string + score) at each. Compressed tries (radix tree, DAWG) collapse single-child chains.
- Build: aggregate query logs into
(query, frequency); insert into trie; compute and cache top-K per node bottom-up. - Update: hourly batch rebuild for the bulk; in-memory delta trie for the latest hot terms; merge at query time.
For massive corpora, an FST (Finite State Transducer) is more memory-efficient than a trie. Lucene uses FSTs internally for autocomplete.
Ranking
The score for a completion is a weighted sum:
score = w1 · popularity + w2 · recency + w3 · personalization + w4 · commerce_signal
- Popularity — query frequency over the trailing 30 days. Log-scale to compress the long tail.
- Recency — recent query bias for trending terms. Decaying counter (EWMA) per query.
- Personalization — user's own past queries, location, language. Applied as a re-rank on top of the global top-K rather than a separate trie per user.
- Commerce (e-commerce autocomplete) — bias toward in-stock high-margin SKUs.
Personalization is a re-rank, not a re-fetch: get top-50 from the global trie, blend with user history, return top-10.
Client-side Debouncing
Without debouncing, every keystroke sends a request: a 10-character query produces 10 requests, 9 wasted. Strategies:
- Time debounce — wait 100–200 ms after the last keystroke before firing. Cancels in-flight requests for older prefixes.
- Pre-fetch on focus — fetch the top-K for the empty prefix when the box is focused; show as the "recent" / "trending" menu.
- Local cache — keep results for prefixes the user just typed; backspace returns instantly.
- Race-cancel — a new keystroke aborts any pending request; only the freshest result renders.
Debouncing is the single biggest server-load reduction. 80% of would-be requests never leave the browser.
Edit Distance and Typo Tolerance
Users mistype. "itlay" should still suggest "italy". Approaches:
- Levenshtein automaton — build an NFA that accepts strings within K edits of the prefix; intersect with the trie / FST. O(prefix × edit_distance) per node visited; standard in Lucene.
- Phonetic encoding (Soundex, Metaphone) — index by phonetic key; "Knight" and "Nite" share the same phonetic. Catches sound-alike misspellings.
- Symspell — precompute deletes for every dictionary word; lookup the misspelled query against the deletes table. Memory-heavy but blazingly fast.
Tier the response: exact-match completions first, then within-1-edit, then within-2-edits. Stop at the first tier that yields enough results.
The 100 ms p99 Budget
Where the time goes:
- Network round-trip: 30–60 ms (geography). Mitigation: anycast + edge nodes.
- Edge cache: 1–3 ms on hit; if hit, you're done.
- Origin trie lookup: 1–5 ms in-memory; the trie is hot.
- Personalization re-rank: 5–20 ms; usually the longest origin step.
- Render: 5–20 ms in the browser.
Cache aggressively: prefix-only results at the CDN (no personalization needed when prefix is short and high-frequency). For personalized results, cache per-user for the session.
Index Pipeline
Where new completions come from:
- Search submission logs — the dominant signal. Aggregate
(query, count, last_seen); filter spam (per-IP rate, bot detection); filter sensitive terms (NSFW, PII). - Catalog (e-commerce) — product titles + categories.
- Editorial — manual whitelists for new terms (a movie launch, a news event).
Pipeline: Kafka → Spark / Flink aggregation → trie builder → deploy via blue-green swap. The hot-term streaming path skips the batch and merges into the in-memory delta trie within a minute.
Failure Modes
- Spammy queries pollute the trie — bot-driven query farms make nonsense terms top suggestions. Filter at ingest by per-IP / per-account rate; require > N distinct submitters before a term is eligible.
- Stale trie after rebuild bug — pipeline fails silently; suggestions go cold. Alert on rebuild age.
- Personalization leak — user A's recent searches appear in user B's autocomplete due to caching key mistake. Always vary on user_id; never cache personalized results across users.
- Edge cache stampede — viral prefix invalidates; everyone misses simultaneously. Soft-TTL + background refresh.
FAQ
Trie or Elasticsearch?
For pure-prefix autocomplete on a fixed corpus, in-memory trie wins on latency. Elasticsearch's edge-ngram analyzer or completion suggester works for general search; ~10–30 ms latency.
How do you handle CJK languages?
Unicode-correct tokenization. Chinese / Japanese / Korean prefix is per-character; segment large queries at word boundaries (jieba, MeCab) for ranking.
What about query-time A/B testing?
Variant assigned at the gateway; encoded in the request to autocomplete service; different ranking weights or different feature flags. Track CTR per variant downstream.
Should I show search history?
Yes for logged-in users (privacy controls required); no for anonymous (creepy). Recent history outranks general popularity for the user.