Social Graph
A social graph is a billion-node, ten-billion-edge structure with extreme degree skew: most users have hundreds of edges, a few have hundreds of millions. Queries are read-dominated (every feed view, every recommendation), writes are bursty (a follow campaign), and the only operations that matter are "my edges", "their edges", and "shortest path". The architecture choices are dominated by the celebrity problem and graph partitioning.
Architecture
Capacity Estimation
| Metric | Value | Notes |
|---|---|---|
| Users | ~3 B | Facebook-scale |
| Edges | ~500 B | followers + interests + actions |
| Edge reads/s | ~50 M | cache-hit dominated |
| Edge writes/s | ~500 K | follows, likes, comments |
| Avg degree | ~150 | per-user friends |
| p99 degree | ~50 M | celebrities |
| Storage | ~50 TB | just edges, with metadata |
Storage: TAO, JanusGraph, Neo4j
TAO (Facebook's "The Associations and Objects") is the canonical example: a write-through cache layer over sharded MySQL, exposing two primitives — objects (typed nodes) and associations (typed directed edges with metadata). Reads hit a leader-follower cache hierarchy; writes go through the shard's leader, then asynchronously to followers. TAO trades strict consistency for read latency and bandwidth efficiency — the eventual-consistency window is < 1 s in practice.
JanusGraph is open-source, pluggable storage (Cassandra, HBase) and indexing (Elasticsearch). Strong on Gremlin traversal. Operationally heavier than TAO but you do not have to build it.
Neo4j is single-master, ACID, expressive Cypher. Fits ~10 B-edge graphs on one beefy box; horizontal scale via fabric is a recent and rough capability. Best for analytical graph queries on enterprise-scale data, not consumer social.
Practical lesson: most consumer-scale social graphs end up looking like TAO regardless of starting choice — an edge-list table sharded by source vertex, with a write-through cache. The "graph DB" label matters less than the access pattern.
Shortest Path and Friend-of-Friend
Six-degrees queries on a billion-node graph cannot do online BFS. The trick is bidirectional BFS: expand from both endpoints simultaneously, terminate when frontiers meet. Cuts the search space from bd to 2 · bd/2. For social graphs at average degree 150 and depth 4, that is 1504=506M vs 2×1502=45K nodes touched.
Friend-of-friend is the friendly cousin: enumerate neighbors of neighbors, intersect with a candidate set, score, return top-k. The cost is dominated by celebrity edges (see below); production systems precompute FoF for the top fraction of users via batch Spark jobs.
Graph Partitioning
You cannot fit the graph on one machine. Partition by node: each shard owns a subset of users + their outgoing edges. Then the question is which partitioning:
- Hash by user_id — uniform load, but every traversal crosses many shards.
- Community-based (METIS, Louvain) — minimizes cross-shard edges but expensive to compute and rebalance as the graph evolves.
- Geography — users by country / region. Locality is good for in-region queries; falls apart for international friend graphs.
Most production systems hash and accept the cross-shard traversal cost, because the alternative requires global recomputation as users join/leave.
The Celebrity Problem
One user with 100 M followers means one shard has 100 M edges anchored at that user. Three flavors of pain:
- Read amplification — "is X following Y?" for celebrity Y hits a giant edge list. Maintain a reverse index sharded by follower so the lookup is O(1) on the follower side.
- Fan-out on write — celebrity posts; you need to deliver to 100 M home feeds. Pure push fan-out blocks the writer for tens of minutes. Hybrid: push to most users, pull for celebrity-followed users (their feed is assembled at read time by querying celebrities they follow).
- Hot shard — the celebrity's shard is hammered by every "who follows X?" query. Replicate hot user records to multiple shards; route queries randomly.
Empirical heuristic from Twitter's feed design: anyone with > 10 K followers is treated as a celebrity for fan-out purposes.
Reciprocal vs Directional Follows
Facebook (friends) is reciprocal: a friendship is two directed edges, both must agree. Twitter (followers) is directional: A follows B without B's consent. Storage:
- Reciprocal: store as a single edge per pair (
A–B) keyed canonically (smaller-id first); a friend list is a range query on either endpoint. - Directional: store two edge lists per user — following (out-edges) and followers (in-edges). Doubles storage; halves query cost on common reads.
Mixed graphs (LinkedIn: connections + follows) keep the two relationships in separate tables; never overload one column.
Failure Modes
- Replication lag spike — user adds friend, refreshes, friend not visible. Read-your-writes via session-pinning to the leader for that user's window.
- Mass unfollow campaign — bot army unfollows a celebrity; 10 M edge deletes pile up on one shard. Throttle delete throughput per target.
- Schema migration on edges — adding a column to
edge_listacross hundreds of shards. Online schema-change tools (gh-ost, pt-osc) are mandatory. - Garbage collection — dead accounts leave dangling edges. Background sweep marks orphans; a soft-delete window protects against false positives.
FAQ
Should I use a graph database?
Only if your queries are deeply traversal-heavy (5+ hops, complex pattern matching). For "list my friends" and "do A and B share a friend?", a sharded relational store with a cache (TAO-style) is simpler and faster.
How do you do PageRank on a social graph?
Offline. Pregel/Spark GraphX once daily; serve the score as a feature in ranking. Online recomputation is infeasible.
What about end-to-end consistency?
Within a session, pin reads to the leader for < 1 minute after a write. Cross-session, accept eventual consistency — users do not notice 1 s lag on a follow.
How does fan-out interact with feed ranking?
Fan-out delivers candidate posts; ranking happens at read time over the candidate set. Separating production (fan-out) from consumption (ranking) lets you change ranking without re-fanning history.