Design a News Feed
Fanout, Ranking, and the Celebrity Problem at a Billion-User Scale
The News Feed problem is the canonical "social-graph at scale" interview question. The naive approach — "on read, query every post by every user I follow, sort by time" — works for tens of users and falls over by the millionth. The right answer is not a single algorithm but a constellation of trade-offs: fanout-on-write vs fanout-on-read, the celebrity-account special case, a multi-stage ranking pipeline that turns candidate posts into a personalized timeline, a hot-feed cache in Redis, and async media delivery via CDN. Twitter, Facebook, Instagram, and TikTok all solve this problem — and all settled on similar shapes for similar reasons.
Why This Problem Is Hard
Users open the app 10–30 times a day; they post once. That's a 100:1 to 300:1 read-to-write ratio. Optimizing the write path for cheapness while making the read path fast is the central architectural lever. Fanout-on-write trades expensive writes for cheap reads.
Follower counts follow a power law. A normal user has ~200 followers; a small celebrity has 500K; the largest accounts have 500M. A naive fanout-on-write where every post is copied to every follower's inbox is fine at 200 followers, catastrophic at 500M. The "celebrity problem" is the reason every real news-feed system is hybrid.
Modern feeds are not chronological. A post's score depends on dozens of features: recency, engagement, author affinity, content embeddings, dwell-time predictions. Ranking isn't a sidecar; it's the hot path. The architecture has to deliver a candidate set to the ranker in tens of milliseconds, or the user sees a blank screen.
High-Level Architecture
Key Numbers
Capacity Estimation
The arithmetic that motivates every architectural choice:
# Inputs (Twitter-scale)
DAU = 250,000,000
posts per user/day = 2 (avg, very skewed)
followers per user = 200 (avg, very skewed)
feed opens / day = 10
posts per feed = 100 (cached candidate set)
# Writes
posts/sec = 250M * 2 / 86400 = ~5,800 posts/sec average
peak = 5,800 * 3 = ~17,000 peak QPS
# Pure fanout-on-write
fanout writes/day = 250M * 2 * 200 = 100B/day
fanout writes/sec = 100B / 86400 = 1.16M writes/sec average
peak = ~3.5M writes/sec
storage (post_id 8B + score 8B) = 16 bytes/entry
daily fanout bytes = 100B * 16 = 1.6 TB/day
# But 0.01% of users have 500M followers each
1000 celebrity posts/day * 500M followers = 500B writes/day
ALONE -- 5x the rest of the system put together.
# Reads
feed_opens/sec = 250M * 10 / 86400 = ~29,000 opens/sec avg
peak = ~87,000 opens/sec
posts hydrated/sec at peak = 87,000 * 100 = 8.7M posts/sec
# Storage
post body avg 280 chars + media URLs ~ 1 KB
total posts/day = 500M
posts storage/day = 500GB
posts storage/year = ~180 TB (just text + metadata)
plus media (S3, separate) = ~4 PB/day at FB scale The numbers force three decisions: (1) you cannot fanout celebrity posts; (2) the inbox cache must be in RAM (Redis), not on disk; (3) you must shed load on the read path with aggressive caching.
Push (Fanout-on-Write) vs Pull (Fanout-on-Read)
| Push / Fanout-on-Write | Pull / Fanout-on-Read | |
|---|---|---|
| When post arrives | Iterate followers, write post_id into each follower's inbox | Just write to author's outbox |
| When feed is read | Read inbox ZSET (one query) | For each followee, read their outbox; merge top-N |
| Read latency | 5–20 ms (single Redis hit) | 50–500 ms (N queries + merge) |
| Write amplification | O(followers) | 1 |
| Storage | O(followers × posts) | O(posts) |
| Best for | Normal users, hot feeds | Celebrity authors, inactive users |
| Failure mode | Celebrity post = millions of writes | Active user with 5K followees = 5K queries |
The classic Twitter answer — documented in their 2013 blog "The Infrastructure Behind Twitter" — is hybrid:
- For users with < ~10,000 followers: fanout on write. The post is pushed into each follower's Redis-backed home timeline (sorted set keyed by tweet ID).
- For users with ≥ ~10,000 followers ("celebrities"): fanout on read. The post is stored only in the author's user timeline. When a follower opens their feed, the system reads the home timeline from Redis and queries the user timelines of every celebrity they follow, merging the results.
The threshold is tuned per platform — Facebook used a higher one historically (~100K) because Facebook posts are richer (less frequent, more expensive to merge). Some implementations don't use a hard cutoff but a soft one: every post above N followers triggers fanout to active followers (logged in within X days) only, leaving cold users to be served via pull on demand.
The Inbox: Redis Sorted Sets
The home timeline cache is a Redis sorted set per user. Each member is a post_id; the score is a timestamp (or, for ranked feeds, a precomputed score). Operations:
# Push to follower inbox (called by fanout worker)
ZADD home:{user_id} <timestamp> <post_id>
ZREMRANGEBYRANK home:{user_id} 0 -801 # keep only 800 most recent
# Read top-N for feed open
ZREVRANGE home:{user_id} 0 99 WITHSCORES
# Per-user cap is critical: without ZREMRANGEBYRANK,
# memory blows up. 800 entries x 16 bytes = 12.8 KB / user
# 250M users x 12.8 KB = 3.2 TB cluster-wide
# Sharding: hash(user_id) -> Redis node
# Shard locality: keep all of user's data co-located
# Replication: 1 primary + 2 replicas per shard
# Eviction policy: allkeys-lru
# Cold inactive users get evicted; their feed is rebuilt
# from posts table on next open (acceptable -- rare event) Twitter's original "Timeline Service" used a custom in-memory store (an evolved memcached called TFE) for exactly this pattern; modern designs typically use Redis Cluster or a managed equivalent like ElastiCache. The cap (Twitter caps at ~800 entries per home timeline) means a user who hasn't opened the app in days only sees the most recent 800 posts they would have, not the full historical backfill — a deliberate product choice.
The Celebrity Path: Merge at Read Time
# Pseudo-code for hybrid feed assembly
def get_home_feed(user_id, limit=100):
# 1. Read pushed posts (fanout-on-write inbox)
pushed = redis.zrevrange(f"home:{user_id}", 0, limit*2-1, withscores=True)
# 2. List celebrity authors this user follows
celebs = graph.celebrities_followed_by(user_id) # cached, max ~50
# 3. Read each celebrity's outbox in parallel
celeb_posts = []
with ThreadPoolExecutor(max_workers=20) as pool:
for celeb_id in celebs:
celeb_posts.extend(pool.submit(redis.zrevrange,
f"user:{celeb_id}:posts", 0, 50, withscores=True).result())
# 4. Merge by score (timestamp or ranked score)
candidates = heapq.merge(pushed, celeb_posts, key=lambda x: -x[1])
# 5. Take top-K, send to ranker
return list(itertools.islice(candidates, 500))
# Total latency budget:
# Step 1: 5 ms (one Redis hit)
# Step 2: 1 ms (cached graph lookup)
# Step 3: 20 ms (N parallel Redis hits, bounded by max followees-of-celebs)
# Step 4: 5 ms
# Step 5: 5 ms
# = ~36 ms before ranking The number of celebrities any user follows is bounded by social reality (people don't follow more than ~50 mega-accounts), which makes the merge cost predictable. Without that bound, the celebrity path would itself become unbounded — which is why some platforms also restrict who can become a "celebrity" account (verified, public figure designations) or pre-aggregate celebrity posts into a global hot-list.
The Ranking Pipeline
Modern feeds run a multi-stage funnel that turns thousands of candidates into ~30 ranked posts:
Pull from many sources: hybrid inbox (above), trending posts in user's location/topic, collaborative-filtering candidates ("users like you also liked"), explicit interest-graph matches, sponsored content. Each source contributes a few hundred candidates. Latency: ~50 ms.
A cheap model (gradient-boosted trees on a few hundred features) scores all candidates and keeps the top 100. Features: post age, author affinity, content type, basic engagement counts. Throughput: ~10K candidates/ms. This filter is what lets the next stage afford to run.
A neural model with thousands of features per (user, post) pair. Predicts probability of multiple actions: like, comment, share, dwell-time, hide. Combines them with platform-specific objective weights. Latency: ~30–50 ms for 100 candidates batched in one inference.
Apply business logic: don't show 5 posts from the same author in a row, mix in ads at fixed cadence, demote near-duplicates, enforce time-since-last-seen. Returns final ordered list. ~5 ms.
Twitter open-sourced parts of "the algorithm" in 2023 — their ranking pipeline matches this shape almost exactly: a SimClusters-based candidate generation, a "light ranker" using a real-time feature store, and a "heavy ranker" neural model. Facebook's analogous pipeline was described in their 2017 paper "DLRM: Deep Learning Recommendation Model."
Database Choices
| Data | Store | Why |
|---|---|---|
| Posts (raw content) | Cassandra / Manhattan / TAO | Write-heavy, append-mostly, partitioned by author |
| Follow graph | FlockDB / sharded MySQL / TAO | Read-heavy, billions of edges, simple lookups |
| Home timelines | Redis Cluster | In-RAM ZSET; bounded per user; sub-ms reads |
| User profiles | MySQL / Postgres sharded | Strong consistency, low QPS per user |
| Media files | S3 / Haystack / Manifold | Object store; immutable; CDN-cacheable |
| Search | Elasticsearch / Manhattan | Inverted index, fuzzy matching |
| Analytics events | Kafka → HDFS / S3 → Spark | Append-only at firehose scale |
| ML feature store | Feast / custom (e.g. FBLearner Flow) | Online + offline parity; sub-10ms reads |
Cassandra is the canonical choice for posts because the data shape fits its strengths exactly: partitioned by user_id (one shard per author), clustered by timestamp descending, writes never update existing rows (just append a new post). Reads are always "latest N posts by user X," which becomes a single-shard, single-partition seek. Twitter built Manhattan as a Cassandra-style store with operational improvements; Facebook built TAO as a graph-aware cache layer over MySQL.
Twitter's Home Timeline, Specifically
Twitter's blog and engineering talks are unusually open about their architecture. The shape, as of 2023:
- Tweet creation → written to Manhattan (their custom key-value store, conceptually similar to Cassandra). User timeline (the author's own posts) is materialized.
- Tweet event → Kafka → fanout service.
- Fanout queries the social graph (followers of author). For each follower with < 10K followees and recently active, push the tweet ID into their Timeline Service (TLS) home timeline cache — an in-memory store derived from Redis-style ZSETs.
- Heavy hitters (accounts with millions of followers) skip fanout. Their tweets are pulled at read-time.
- On feed open, the Home Mixer service: reads the TLS home timeline, fetches candidate tweets from celebrity outboxes, queries other candidate sources (Out-of-Network, e.g. SimClusters-recommended tweets from accounts you don't follow), batches them through the light ranker, then the heavy ranker, applies diversification, returns ~30 tweets.
- Hydration happens last: take the final 30 tweet IDs, pull tweet bodies and media URLs in parallel, pull engagement counts (likes/retweets) from a separate counters service, return to the client.
Tradeoffs & Failure Modes
- Fanout backlog under bursts. A celebrity who suddenly gains followers can spike fanout queue depth. Solution: rate-limit per-author fanout, prioritize active follower delivery, accept eventual consistency for inactive users.
- Hot author shards. A celebrity's user_timeline is read by millions on every feed open. Solution: replicate hot-author timelines to many cache nodes, pre-warm during traffic spikes (live events).
- Inbox cache eviction storms. If a Redis node fails and its replicas can't keep up, millions of users see empty feeds while the cache rebuilds. Solution: use replicated sorted-sets, slow-rebuild from Cassandra in background.
- Ranking model drift. A trained model is good for a few days; user behavior shifts. Continuous online training pipelines (Spark/Flink) are mandatory; A/B harnesses to evaluate new models before rollout are a separate problem of comparable size.
- Privacy / blocking. A user blocks another. The blocked user's posts must not appear in feed. Either filter at fanout time (small inboxes contaminate) or filter at hydration time (read-time cost). Most platforms filter at hydration since blocks are rare and the post-blacklist join is cheap.
- Edited / deleted posts. Posts in inboxes become stale. Either store only IDs and re-fetch at read time (always fresh, more reads), or push edits/deletes through Kafka to invalidate (more writes, lower read amplification). Most large platforms do the former.
- Mute / unfollow churn. Frequent follow/unfollow doesn't backfill or remove the inbox; future posts are correctly filtered, but historical posts in the cache are not. Acceptable to most products.
- Cold-start ranking. A brand-new user has no engagement history; the heavy ranker has nothing to score from. Fall back to popularity-based defaults until ~10 interactions accumulate, then transition to personalized model.
FAQ
Why not just store everything in a relational database with secondary indexes?
Math. 500B fanout writes/day at peak ~3M writes/sec exceeds the per-shard write capacity of any relational system without sharding so aggressively that you've reinvented Cassandra. Reads on the home timeline must be O(1) per user; that means the index is the data. Redis ZSET fits the shape exactly.
What's the per-user RAM cost of the home timeline cache?
~12–15 KB if you cap at 800 entries (post_id + score = 16 bytes, plus ZSET overhead). At 250M DAU that's ~3.5 TB of cache — expensive but tractable across a Redis cluster of ~150 nodes with 24 GB each. Inactive users get evicted via LRU; their cache is rebuilt on next login from Cassandra at the cost of one slow feed open (~1–2 s).
How is "ranking" different from "sorting by recency"?
Recency is a single feature. A ranked feed scores each candidate post against many features: author affinity (do you usually engage with this person), content embedding similarity to your history, predicted dwell time, predicted action probabilities (like / comment / share). The model is trained on logged user behavior to predict downstream engagement. Recency is one input of the model, not the output.
Why not let the client merge multiple followee outboxes? Push the work to mobile.
Network round trips. A mobile client doing N parallel fetches over LTE pays per-request latency on each one. Server-side merging happens with sub-ms inter-node RPCs in the same datacenter. Also, ranking needs cross-candidate features (e.g., "this is the 5th post from this author this session") that the client can't compute without seeing all candidates.
What happens when a celebrity's tweet goes viral — how does the system handle the read amplification?
The tweet's user_timeline shard is read by millions. Solutions: (1) replicate hot tweets to many cache nodes (consistent hashing with virtual nodes lets you spread one author across many shards for hot data); (2) push the tweet through a separate "trending" cache fronted by the CDN with aggressive TTLs; (3) at the application layer, dedupe identical viral tweets across feeds and serve from a single edge cache.
How fresh is the feed in practice? Is it real-time?
Eventual consistency, typically <5 seconds end-to-end (post created → visible in followers' inboxes). For 99.9% of follower pairs that's fine. The exception is "I just posted — why don't I see my own post yet?" — solved by writing the post to your own home timeline synchronously on creation, before the async fanout job runs.
Could you build a chronological-only feed and skip ranking entirely?
You can — that's how Twitter's "Latest" tab and Mastodon work. Engineering is far simpler (no model serving, no online features). Engagement is dramatically lower. Modern social products all use ranked feeds because ranked feeds keep users in the app longer; chronological feeds are a setting, not the default.
How do you A/B test ranking changes without breaking everyone's feed?
Run multiple ranker variants in parallel, hash users to variants, assign each user to one ranker deterministically. The Home Mixer routes user_id -> variant_id, calls that variant's model, logs interactions tagged with variant. Online metrics (CTR, dwell time, retention) are compared across variants. Twitter and Meta both run hundreds of A/B tests on the ranker simultaneously.