Collaborative Editing
When two users type into the same document at the same time, the system must converge on a single document state without losing edits and without confusing either user's intent. The two competing solutions — Operational Transformation (OT) and Conflict-free Replicated Data Types (CRDTs) — have shaped Google Docs (custom OT), Notion (block CRDT), and the modern open-source ecosystem (Yjs, Automerge). The right choice depends on whether you control the network topology and how rich the document schema is.
Architecture
Capacity Estimation
| Metric | Value | Notes |
|---|---|---|
| Active docs (peak) | ~10 M | open in tabs |
| Concurrent editors per doc | 1–100 | typical 2–5 |
| Edits/s per active doc | 1–5 | typing burst |
| WebSocket conns | ~30 M | doc viewers |
| Op log retention | 30 d hot, ∞ cold | full history needed |
| Snapshot frequency | every 1000 ops | recovery time |
The Convergence Problem
Two users start with "hello". Alice types "X" at position 0, Bob types "Y" at position 5. Concurrent ops, both clients race to send. Without coordination, Alice ends with "Xhello", Bob with "helloY", and after sync the naive replay gives different results on each side. The convergence requirement: every replica that has applied the same set of ops, in any order, ends up at the same document state.
OT and CRDT solve this with different mechanics; both are correct. The architectural differences below matter more in practice than the algorithmic ones.
Operational Transformation (OT)
Each op carries position + content. When the server applies op B that is concurrent with op A already applied, it transforms B against A so B's indices shift to account for A's effect. Then forwards the transformed op to other clients.
- Pros: small ops, intuitive for plain-text documents, decades of practice (Google Wave, Google Docs).
- Cons: requires a central server to define a canonical order; the transform function is tricky — the TP1 and TP2 properties have to hold for every op pair, and the literature is full of correctness bugs in published implementations.
Google Docs uses a custom server-mediated OT with rich operations (insert/delete + style spans). It only works because Google has central servers and tight control.
CRDT (Conflict-free Replicated Data Types)
CRDTs sidestep transformation by giving every character a globally unique ID. Insertions become "insert this character with ID X between IDs Y and Z." Concurrent inserts at the same point are resolved by ID order. The local state is the linked list (or tree) of these IDs; rendering walks them in order.
- Pros: no central server required (P2P safe), simpler correctness story, network can deliver ops out of order.
- Cons: per-character metadata is heavy — naive CRDTs balloon storage 10–100× the document size. Modern variants (RGA, Yjs's Yata, Logoot, Automerge) compress and prune to keep this manageable.
Yjs (Kevin Jahns) is the highest-performance open-source CRDT today, used by Notion, Linear, and most "instant collab" tooling. Automerge emphasizes JSON-document semantics and offline-first sync.
Google Docs (custom OT)
Google Docs predates modern CRDTs. The architecture is server-mediated OT with:
- Each client maintains a local op buffer; sends to server with a revision number.
- Server transforms incoming ops against any newer ops, applies, broadcasts transformed ops to other clients.
- Periodic snapshotting; ops older than the snapshot are pruned.
- Style spans (bold, link) are operations on attribute ranges; the OT extension to handle them is an internal Google secret-sauce.
The architecture is tightly coupled to having a central authority. Offline edits are accepted but are awkward: when you reconnect, your buffered ops are transformed against everything you missed; long offline periods produce unintuitive merge results.
Notion (block-level CRDT)
Notion documents are trees of blocks (paragraph, heading, list item, embed). Each block has a stable ID and child pointers. Edits are operations on individual blocks (insert child, delete, move) plus per-block content edits. Block-level CRDT semantics make moves trivially conflict-free: two users dragging different blocks just produce both moves.
Within a block, content is plain text edited via a sub-CRDT (Yjs-style). The two-tier structure mirrors how users think about pages: the document is a list of blocks, each block is text. Conflict surface is dramatically reduced because most edits never touch the same block.
Real-time Presence Cursors
Showing other people's cursors in the doc is a presence channel orthogonal to the op stream. Implementation:
- Each client publishes its cursor position (anchor, head) on a separate WebSocket message every ~100 ms while idle, or on every keystroke up to a rate cap.
- Server fans out to other clients viewing the doc.
- Cursor positions are relative to the CRDT IDs, not byte offsets — otherwise a remote insert moves your cursor render under you.
- Color and name come from a stable per-user palette; ephemeral, not persisted.
Avoid persisting presence; treat it as truly transient. A user who closes their tab should disappear within seconds, not haunt the doc until garbage collection.
Peer-to-Peer vs Server-Mediated
CRDTs make P2P possible: WebRTC data channels between editors, with a server only for discovery and offline relay. Pros: lower latency in shared LAN, no server hosting cost.
Cons in practice that push everyone to server-mediated:
- Persistence — you still need a server to store the doc when no peer is online.
- Auth/ACL — access control needs a central authority.
- Network reliability — not all peers can reach each other (NAT, corporate firewalls).
- Audit and compliance — enterprise demands a server-side log.
Most production systems are server-mediated CRDT: clients send ops to the server, server logs and broadcasts. P2P is reserved for offline-first or specialized use.
Storage: Op Log + Snapshots
Two append-only streams:
- Op log — every op, totally ordered per doc. Stored in an append-only table (Postgres) or object store.
- Snapshot — the materialized doc state as of op N. Saves recovery time: rather than replay 100K ops, load snapshot + replay the tail.
Garbage collection: snapshots compact tombstoned characters, reducing CRDT bloat. Old ops can be archived to cold storage; only the recent window is hot.
Failure Modes
- Op duplication — client retries on disconnect, server applies twice. Use op IDs for dedup; CRDTs are idempotent by design, OT requires explicit dedup.
- Diverged state — client missed an op, applies subsequent ops and renders wrong content. Periodic state-vector exchange detects divergence; on detection, client requests delta.
- Long offline edit — client returns with 1000 buffered ops; merge is correct but UX is jarring (text appears in unexpected places). Show a clear "syncing" UI; in extreme cases, fork into a branch.
- Server bottleneck — one super-active doc saturates a server. Shard by doc_id; sticky-session via consistent hash so all clients of one doc land on one node.
FAQ
OT or CRDT — which should I choose?
For new projects, CRDT (Yjs). The library ecosystem and offline-first story is better, and you do not need to invent a transform function. OT is appropriate only if you have an existing central server and a simple text-only document.
How do you handle large pastes?
One large insert op (CRDT bulk insert) rather than per-character. Modern CRDTs compress consecutive inserts; size impact is <2× the byte length.
What about undo / redo?
Per-user undo stack. Easy in OT (invert the op locally). In CRDT, requires explicit undo semantics (Yjs has UndoManager); naive "delete what I just added" deletes whatever character has that ID, which may have moved.
Can I use this for a code editor?
Yes — VS Code Live Share is essentially this. Add cursor decorations, syntax highlighting that survives merges, and a fast renderer. Yjs ships a Monaco binding.