Design a News Feed

Fanout, Ranking, and the Celebrity Problem at a Billion-User Scale

The News Feed problem is the canonical "social-graph at scale" interview question. The naive approach — "on read, query every post by every user I follow, sort by time" — works for tens of users and falls over by the millionth. The right answer is not a single algorithm but a constellation of trade-offs: fanout-on-write vs fanout-on-read, the celebrity-account special case, a multi-stage ranking pipeline that turns candidate posts into a personalized timeline, a hot-feed cache in Redis, and async media delivery via CDN. Twitter, Facebook, Instagram, and TikTok all solve this problem — and all settled on similar shapes for similar reasons.

Why This Problem Is Hard

Read amplification

Users open the app 10–30 times a day; they post once. That's a 100:1 to 300:1 read-to-write ratio. Optimizing the write path for cheapness while making the read path fast is the central architectural lever. Fanout-on-write trades expensive writes for cheap reads.

Skewed graph

Follower counts follow a power law. A normal user has ~200 followers; a small celebrity has 500K; the largest accounts have 500M. A naive fanout-on-write where every post is copied to every follower's inbox is fine at 200 followers, catastrophic at 500M. The "celebrity problem" is the reason every real news-feed system is hybrid.

Ranking is the product

Modern feeds are not chronological. A post's score depends on dozens of features: recency, engagement, author affinity, content embeddings, dwell-time predictions. Ranking isn't a sidecar; it's the hot path. The architecture has to deliver a candidate set to the ranker in tens of milliseconds, or the user sees a blank screen.

High-Level Architecture

Key Numbers

DAU (Twitter ~ 2024)

~250M

Posts / day

~500M

Avg followers

~200

Top accounts

100M–500M followers

Read:Write ratio

~100:1

Feed open p95 target

200 ms

Ranker latency budget

50 ms

Storage growth (FB ~)

~4 PB / day

Capacity Estimation

The arithmetic that motivates every architectural choice:

# Inputs (Twitter-scale)
DAU                = 250,000,000
posts per user/day = 2          (avg, very skewed)
followers per user = 200        (avg, very skewed)
feed opens / day   = 10
posts per feed     = 100        (cached candidate set)

# Writes
posts/sec  = 250M * 2 / 86400  =  ~5,800 posts/sec average
peak       = 5,800 * 3         = ~17,000 peak QPS

# Pure fanout-on-write
fanout writes/day = 250M * 2 * 200       = 100B/day
fanout writes/sec = 100B / 86400         = 1.16M writes/sec average
peak                                     = ~3.5M writes/sec
storage (post_id 8B + score 8B)         = 16 bytes/entry
daily fanout bytes                      = 100B * 16 = 1.6 TB/day

# But 0.01% of users have 500M followers each
1000 celebrity posts/day * 500M followers = 500B writes/day
   ALONE -- 5x the rest of the system put together.

# Reads
feed_opens/sec = 250M * 10 / 86400 = ~29,000 opens/sec avg
peak                               = ~87,000 opens/sec
posts hydrated/sec at peak         = 87,000 * 100 = 8.7M posts/sec

# Storage
post body avg 280 chars + media URLs ~ 1 KB
total posts/day            = 500M
posts storage/day          = 500GB
posts storage/year         = ~180 TB (just text + metadata)
plus media (S3, separate)  = ~4 PB/day at FB scale

The numbers force three decisions: (1) you cannot fanout celebrity posts; (2) the inbox cache must be in RAM (Redis), not on disk; (3) you must shed load on the read path with aggressive caching.

Push (Fanout-on-Write) vs Pull (Fanout-on-Read)

	Push / Fanout-on-Write	Pull / Fanout-on-Read
When post arrives	Iterate followers, write post_id into each follower's inbox	Just write to author's outbox
When feed is read	Read inbox ZSET (one query)	For each followee, read their outbox; merge top-N
Read latency	5–20 ms (single Redis hit)	50–500 ms (N queries + merge)
Write amplification	O(followers)	1
Storage	O(followers × posts)	O(posts)
Best for	Normal users, hot feeds	Celebrity authors, inactive users
Failure mode	Celebrity post = millions of writes	Active user with 5K followees = 5K queries

The classic Twitter answer — documented in their 2013 blog "The Infrastructure Behind Twitter" — is hybrid:

For users with < ~10,000 followers: fanout on write. The post is pushed into each follower's Redis-backed home timeline (sorted set keyed by tweet ID).
For users with ≥ ~10,000 followers ("celebrities"): fanout on read. The post is stored only in the author's user timeline. When a follower opens their feed, the system reads the home timeline from Redis and queries the user timelines of every celebrity they follow, merging the results.

The threshold is tuned per platform — Facebook used a higher one historically (~100K) because Facebook posts are richer (less frequent, more expensive to merge). Some implementations don't use a hard cutoff but a soft one: every post above N followers triggers fanout to active followers (logged in within X days) only, leaving cold users to be served via pull on demand.

The Inbox: Redis Sorted Sets

The home timeline cache is a Redis sorted set per user. Each member is a post_id; the score is a timestamp (or, for ranked feeds, a precomputed score). Operations:

# Push to follower inbox (called by fanout worker)
ZADD home:{user_id} <timestamp> <post_id>
ZREMRANGEBYRANK home:{user_id} 0 -801   # keep only 800 most recent

# Read top-N for feed open
ZREVRANGE home:{user_id} 0 99 WITHSCORES

# Per-user cap is critical: without ZREMRANGEBYRANK,
# memory blows up. 800 entries x 16 bytes = 12.8 KB / user
# 250M users x 12.8 KB = 3.2 TB cluster-wide

# Sharding: hash(user_id) -> Redis node
# Shard locality: keep all of user's data co-located
# Replication: 1 primary + 2 replicas per shard

# Eviction policy: allkeys-lru
# Cold inactive users get evicted; their feed is rebuilt
# from posts table on next open (acceptable -- rare event)

Twitter's original "Timeline Service" used a custom in-memory store (an evolved memcached called TFE) for exactly this pattern; modern designs typically use Redis Cluster or a managed equivalent like ElastiCache. The cap (Twitter caps at ~800 entries per home timeline) means a user who hasn't opened the app in days only sees the most recent 800 posts they would have, not the full historical backfill — a deliberate product choice.

The Celebrity Path: Merge at Read Time

# Pseudo-code for hybrid feed assembly
def get_home_feed(user_id, limit=100):
    # 1. Read pushed posts (fanout-on-write inbox)
    pushed = redis.zrevrange(f"home:{user_id}", 0, limit*2-1, withscores=True)

    # 2. List celebrity authors this user follows
    celebs = graph.celebrities_followed_by(user_id)  # cached, max ~50

    # 3. Read each celebrity's outbox in parallel
    celeb_posts = []
    with ThreadPoolExecutor(max_workers=20) as pool:
        for celeb_id in celebs:
            celeb_posts.extend(pool.submit(redis.zrevrange,
                f"user:{celeb_id}:posts", 0, 50, withscores=True).result())

    # 4. Merge by score (timestamp or ranked score)
    candidates = heapq.merge(pushed, celeb_posts, key=lambda x: -x[1])

    # 5. Take top-K, send to ranker
    return list(itertools.islice(candidates, 500))

# Total latency budget:
#   Step 1: 5 ms (one Redis hit)
#   Step 2: 1 ms (cached graph lookup)
#   Step 3: 20 ms (N parallel Redis hits, bounded by max followees-of-celebs)
#   Step 4: 5 ms
#   Step 5: 5 ms
#   = ~36 ms before ranking

The number of celebrities any user follows is bounded by social reality (people don't follow more than ~50 mega-accounts), which makes the merge cost predictable. Without that bound, the celebrity path would itself become unbounded — which is why some platforms also restrict who can become a "celebrity" account (verified, public figure designations) or pre-aggregate celebrity posts into a global hot-list.

The Ranking Pipeline

Modern feeds run a multi-stage funnel that turns thousands of candidates into ~30 ranked posts:

Candidate Generation (~thousands → ~500)

Pull from many sources: hybrid inbox (above), trending posts in user's location/topic, collaborative-filtering candidates ("users like you also liked"), explicit interest-graph matches, sponsored content. Each source contributes a few hundred candidates. Latency: ~50 ms.

Light Ranker (500 → 100)

A cheap model (gradient-boosted trees on a few hundred features) scores all candidates and keeps the top 100. Features: post age, author affinity, content type, basic engagement counts. Throughput: ~10K candidates/ms. This filter is what lets the next stage afford to run.

Heavy Ranker (100 → ~30)

A neural model with thousands of features per (user, post) pair. Predicts probability of multiple actions: like, comment, share, dwell-time, hide. Combines them with platform-specific objective weights. Latency: ~30–50 ms for 100 candidates batched in one inference.

Re-ranker / Diversifier

Apply business logic: don't show 5 posts from the same author in a row, mix in ads at fixed cadence, demote near-duplicates, enforce time-since-last-seen. Returns final ordered list. ~5 ms.

Twitter open-sourced parts of "the algorithm" in 2023 — their ranking pipeline matches this shape almost exactly: a SimClusters-based candidate generation, a "light ranker" using a real-time feature store, and a "heavy ranker" neural model. Facebook's analogous pipeline was described in their 2017 paper "DLRM: Deep Learning Recommendation Model."

Database Choices

Data	Store	Why
Posts (raw content)	Cassandra / Manhattan / TAO	Write-heavy, append-mostly, partitioned by author
Follow graph	FlockDB / sharded MySQL / TAO	Read-heavy, billions of edges, simple lookups
Home timelines	Redis Cluster	In-RAM ZSET; bounded per user; sub-ms reads
User profiles	MySQL / Postgres sharded	Strong consistency, low QPS per user
Media files	S3 / Haystack / Manifold	Object store; immutable; CDN-cacheable
Search	Elasticsearch / Manhattan	Inverted index, fuzzy matching
Analytics events	Kafka → HDFS / S3 → Spark	Append-only at firehose scale
ML feature store	Feast / custom (e.g. FBLearner Flow)	Online + offline parity; sub-10ms reads

Cassandra is the canonical choice for posts because the data shape fits its strengths exactly: partitioned by user_id (one shard per author), clustered by timestamp descending, writes never update existing rows (just append a new post). Reads are always "latest N posts by user X," which becomes a single-shard, single-partition seek. Twitter built Manhattan as a Cassandra-style store with operational improvements; Facebook built TAO as a graph-aware cache layer over MySQL.

Twitter's Home Timeline, Specifically

Twitter's blog and engineering talks are unusually open about their architecture. The shape, as of 2023:

Tweet creation → written to Manhattan (their custom key-value store, conceptually similar to Cassandra). User timeline (the author's own posts) is materialized.
Tweet event → Kafka → fanout service.
Fanout queries the social graph (followers of author). For each follower with < 10K followees and recently active, push the tweet ID into their Timeline Service (TLS) home timeline cache — an in-memory store derived from Redis-style ZSETs.
Heavy hitters (accounts with millions of followers) skip fanout. Their tweets are pulled at read-time.
On feed open, the Home Mixer service: reads the TLS home timeline, fetches candidate tweets from celebrity outboxes, queries other candidate sources (Out-of-Network, e.g. SimClusters-recommended tweets from accounts you don't follow), batches them through the light ranker, then the heavy ranker, applies diversification, returns ~30 tweets.
Hydration happens last: take the final 30 tweet IDs, pull tweet bodies and media URLs in parallel, pull engagement counts (likes/retweets) from a separate counters service, return to the client.

Tradeoffs & Failure Modes

Fanout backlog under bursts. A celebrity who suddenly gains followers can spike fanout queue depth. Solution: rate-limit per-author fanout, prioritize active follower delivery, accept eventual consistency for inactive users.
Hot author shards. A celebrity's user_timeline is read by millions on every feed open. Solution: replicate hot-author timelines to many cache nodes, pre-warm during traffic spikes (live events).
Inbox cache eviction storms. If a Redis node fails and its replicas can't keep up, millions of users see empty feeds while the cache rebuilds. Solution: use replicated sorted-sets, slow-rebuild from Cassandra in background.
Ranking model drift. A trained model is good for a few days; user behavior shifts. Continuous online training pipelines (Spark/Flink) are mandatory; A/B harnesses to evaluate new models before rollout are a separate problem of comparable size.
Privacy / blocking. A user blocks another. The blocked user's posts must not appear in feed. Either filter at fanout time (small inboxes contaminate) or filter at hydration time (read-time cost). Most platforms filter at hydration since blocks are rare and the post-blacklist join is cheap.
Edited / deleted posts. Posts in inboxes become stale. Either store only IDs and re-fetch at read time (always fresh, more reads), or push edits/deletes through Kafka to invalidate (more writes, lower read amplification). Most large platforms do the former.
Mute / unfollow churn. Frequent follow/unfollow doesn't backfill or remove the inbox; future posts are correctly filtered, but historical posts in the cache are not. Acceptable to most products.
Cold-start ranking. A brand-new user has no engagement history; the heavy ranker has nothing to score from. Fall back to popularity-based defaults until ~10 interactions accumulate, then transition to personalized model.

FAQ

Why not just store everything in a relational database with secondary indexes?

Math. 500B fanout writes/day at peak ~3M writes/sec exceeds the per-shard write capacity of any relational system without sharding so aggressively that you've reinvented Cassandra. Reads on the home timeline must be O(1) per user; that means the index is the data. Redis ZSET fits the shape exactly.

What's the per-user RAM cost of the home timeline cache?

~12–15 KB if you cap at 800 entries (post_id + score = 16 bytes, plus ZSET overhead). At 250M DAU that's ~3.5 TB of cache — expensive but tractable across a Redis cluster of ~150 nodes with 24 GB each. Inactive users get evicted via LRU; their cache is rebuilt on next login from Cassandra at the cost of one slow feed open (~1–2 s).

How is "ranking" different from "sorting by recency"?

Recency is a single feature. A ranked feed scores each candidate post against many features: author affinity (do you usually engage with this person), content embedding similarity to your history, predicted dwell time, predicted action probabilities (like / comment / share). The model is trained on logged user behavior to predict downstream engagement. Recency is one input of the model, not the output.

Why not let the client merge multiple followee outboxes? Push the work to mobile.

Network round trips. A mobile client doing N parallel fetches over LTE pays per-request latency on each one. Server-side merging happens with sub-ms inter-node RPCs in the same datacenter. Also, ranking needs cross-candidate features (e.g., "this is the 5th post from this author this session") that the client can't compute without seeing all candidates.

What happens when a celebrity's tweet goes viral — how does the system handle the read amplification?

The tweet's user_timeline shard is read by millions. Solutions: (1) replicate hot tweets to many cache nodes (consistent hashing with virtual nodes lets you spread one author across many shards for hot data); (2) push the tweet through a separate "trending" cache fronted by the CDN with aggressive TTLs; (3) at the application layer, dedupe identical viral tweets across feeds and serve from a single edge cache.

How fresh is the feed in practice? Is it real-time?

Eventual consistency, typically <5 seconds end-to-end (post created → visible in followers' inboxes). For 99.9% of follower pairs that's fine. The exception is "I just posted — why don't I see my own post yet?" — solved by writing the post to your own home timeline synchronously on creation, before the async fanout job runs.

Could you build a chronological-only feed and skip ranking entirely?

You can — that's how Twitter's "Latest" tab and Mastodon work. Engineering is far simpler (no model serving, no online features). Engagement is dramatically lower. Modern social products all use ranked feeds because ranked feeds keep users in the app longer; chronological feeds are a setting, not the default.

How do you A/B test ranking changes without breaking everyone's feed?

Run multiple ranker variants in parallel, hash users to variants, assign each user to one ranker deterministically. The Home Mixer routes user_id -> variant_id, calls that variant's model, logs interactions tagged with variant. Online metrics (CTR, dwell time, retention) are compared across variants. Twitter and Meta both run hundreds of A/B tests on the ranker simultaneously.