RocksDB & the LSM Family
The Log-Structured Merge tree as a practical data structure traces to Google's LevelDB (open-sourced 2011 by Jeffrey Dean and Sanjay Ghemawat from Bigtable's LSM kernel). Facebook forked it in 2012 to add multi-threaded compaction, column families, and tunability — that became RocksDB. Cockroach Labs reimplemented the architecture in pure Go starting 2018 — Pebble. Dgraph took a different design path with key/value separation — BadgerDB. This page traces the family tree, the architectural differences, and where each fork is the right answer.
LSM Family Tree
Key Numbers
Why Multiple LSM Engines
LevelDB: The Original
~20k lines of C++. Single-thread compaction. Rock-solid simplicity.
Google open-sourced LevelDB in 2011, distilled from Bigtable's LSM kernel. Designed for single-process embedded use: Chrome's IndexedDB, Bitcoin Core's chainstate, smaller databases. The architecture: one memtable, one immutable memtable, sorted runs at L0-L7, single-threaded background compaction.
Strengths: minimal API surface (Open, Put, Get, Delete, Iterator), small codebase that
one engineer can read in a week, rock-solid stability after a decade. Weaknesses: no
column families, no transactions, no parallelism — the single compaction thread becomes the
throughput ceiling. No bloom filter tuning. No snapshots beyond the basic
GetSnapshot.
In maintenance mode at Google. New projects don't choose LevelDB; existing dependents (Chrome, Bitcoin Core) maintain it because porting away is more risk than the maintenance cost.
RocksDB: The Server-Side Fork
Facebook's evolution toward production-grade embedded storage.
Forked from LevelDB in 2012 to support Facebook's MyRocks (MySQL with RocksDB storage engine) and ZippyDB (a distributed KV store). Major additions over LevelDB:
- multi-threaded compaction (max_background_jobs) - column families with shared WAL - transactions (TransactionDB, OptimisticTransactionDB, WritePrepared) - tunable bloom filters (full, partitioned, ribbon, prefix) - universal and FIFO compaction styles - backup engine, checkpoint API - statistics, perf context, IO tracing - range deletes (DeleteRange) - compression per level - block cache with priorities - write batch with index - merge operators (counters, set, append) - WriteBufferManager for cross-CF memtable budget
The cost: ~250k LOC, hundreds of options, learning curve. Production deployments (TiKV, MyRocks, Yugabyte, ArangoDB) all do significant tuning work. Compared to LevelDB, you get a far more capable engine but more rope to hang yourself.
Pebble: Go-Native LSM
Cockroach Labs' from-scratch Go implementation, since 2018.
Pebble was started because Cgo overhead made RocksDB-via-Go untenable for hot-path use cases. Reimplementing in Go required ~80k lines of code; the SST format intentionally stays readable by RocksDB (with care; not a guaranteed compatibility), so backups could be ported.
Pebble's distinguishing features over RocksDB:
- Cgo-free; native Go for the entire stack - DELRANGE optimized for Cockroach's MVCC-version-deletes - value blocks (BlobDB-style key/value separation, optional) - improved bulk ingestion path - better-tuned default options for typical Cockroach workloads - tighter integration with Go's runtime: profiling, scheduling, GC
What Pebble doesn't have: the full breadth of RocksDB's tuning (specific compaction strategies, some advanced options), older or non-Cockroach use case features. Pebble is targeted at CockroachDB's use case first; general use is a side benefit.
BadgerDB: The WiscKey Variant
Key-value separation: keys in LSM, values in a separate log.
BadgerDB (Dgraph, 2018) implements the WiscKey design: only keys go through the LSM tree; values are appended to a separate value log file. Each LSM entry stores a value pointer (file offset) instead of the value itself.
Result: compaction rewrites only keys (small) — write amp drops dramatically for workloads with large values. The cost: every Get does an extra file read to fetch the value, and the value log requires its own garbage collection (when keys are deleted, the corresponding value-log entries become orphans).
RocksDB: keys+values rewritten in compaction → high WA on big values BadgerDB: only keys rewritten → low WA, extra read per Get
Where it wins: NoSQL-style workloads with values in the 1-100 KB range — big-value compaction in RocksDB amplifies WA dramatically; BadgerDB's keys-only LSM stays compact. Where it loses: small KV workloads where the value is in the same cache line as the key — RocksDB has no extra read, BadgerDB's pointer dereference becomes pure overhead.
Used by Dgraph, NATS JetStream, IPFS, and various Go-native projects where the workload fits.
Comparison Matrix
| Feature | LevelDB | RocksDB | Pebble | BadgerDB |
|---|---|---|---|---|
| Language | C++ | C++ | Go | Go |
| Multi-threaded compaction | no | yes | yes | yes |
| Column families | no | yes | no | no |
| Transactions | no | yes | yes (basic) | yes |
| Bloom filter tuning | fixed | extensive | good | good |
| K-V separation | no | BlobDB | value blocks | core design |
| Range deletes | no | yes | yes (excellent) | limited |
| Snapshots | basic | full | full | full |
| Code size | ~20k LOC | ~250k LOC | ~80k LOC | ~30k LOC |
| Major users | Chrome, Bitcoin | TiKV, MyRocks | Cockroach | Dgraph, NATS |
FAQ
Why did Facebook fork LevelDB into RocksDB?
LevelDB was designed for a single-user desktop workload (Chrome's IndexedDB, Bigtable) — single-threaded compaction, no column families, basic tuning options. Facebook needed a server-side embedded engine with multi-threaded compaction, configurable bloom filters, column families, transactions, and tunable amplification. They forked in 2012 and have continued to diverge ever since. RocksDB now shares ~30% of the original LevelDB code; the rest has been rewritten.
Is BadgerDB just RocksDB in Go?
No, BadgerDB has a fundamentally different design: it separates keys from values (the 'WiscKey' paper). Keys live in an LSM tree; values live in a separate value log. This dramatically reduces write amplification (only keys get rewritten during compaction) at the cost of a value-pointer dereference on each Get. Excellent for workloads with large values; less optimal for small KV pairs where the LSM structure already dominates. Used by Dgraph, NATS JetStream, and others where it fits.
What's the relationship between RocksDB and Pebble?
Pebble is a from-scratch Go reimplementation by Cockroach Labs, started in 2018. It mimics RocksDB's LSM structure and SST format (so backups can be moved between them with care) but adds Cockroach-specific extensions like better range deletes, value blocks (BlobDB-like), and tighter MVCC integration. Pebble doesn't try to be 100% RocksDB-compatible — it diverges where Cockroach's needs differ. Now used by CockroachDB and a growing list of Go projects.
Should I use LevelDB today?
Almost never for new projects. LevelDB is in maintenance mode at Google — minimal new development. It's still embedded in Chrome (IndexedDB), Bitcoin Core (chainstate), and a few other long-lived consumers. For new code, RocksDB or one of its forks is the better choice: more features, more active development, broader tuning. Use LevelDB only if you specifically need its small footprint (~10-20k LOC vs RocksDB's 200k+) and minimal feature set.
Are SST files compatible across RocksDB, LevelDB, Pebble?
Largely no. RocksDB extended LevelDB's SST format with column-family info, larger metadata, and version markers; LevelDB cannot read RocksDB SSTs. Pebble's SSTs are slightly diverged from RocksDB's (different magic, different version). External SST ingest (db->IngestExternalFile) only works for properly-versioned files. Backups should always use the engine's own backup tool (BackupEngine, etc.), not raw SST copies between engines.
Why aren't there more LSM stores?
Because writing one is hard. The LSM kernel itself is ~5k LOC; the surrounding correctness machinery (recovery, snapshot semantics, atomic writes, compaction priorities, statistics, metrics) is 100k+ LOC of subtle code. Newer stores like Pebble had to invest years to reach production-ready state. Most projects just wrap RocksDB. The exceptions (BadgerDB, FoundationDB's redwood, Pebble) have specific reasons — language affinity, fundamentally different design (WiscKey), or tighter integration with a higher-level system.