RocksDB & the LSM Family

The Log-Structured Merge tree as a practical data structure traces to Google's LevelDB (open-sourced 2011 by Jeffrey Dean and Sanjay Ghemawat from Bigtable's LSM kernel). Facebook forked it in 2012 to add multi-threaded compaction, column families, and tunability — that became RocksDB. Cockroach Labs reimplemented the architecture in pure Go starting 2018 — Pebble. Dgraph took a different design path with key/value separation — BadgerDB. This page traces the family tree, the architectural differences, and where each fork is the right answer.

LSM Family Tree

Key Numbers

LevelDB LOC

~20k

RocksDB LOC

~250k

Pebble LOC

~80k Go

BadgerDB LOC

~30k Go

RocksDB CFs

yes

LevelDB CFs

BadgerDB unique

value log

Why Multiple LSM Engines

Language affinity

Cgo overhead is real. A Go project doing 1M ops/sec to RocksDB pays ~100ns × 1M = 100ms/sec just on FFI. Pebble eliminates this for Go. Rust handles RocksDB FFI better; TiKV doesn't fork.

Specialization

BadgerDB optimizes for big values via key-value separation. RocksDB optimizes for general workloads. LevelDB optimizes for simplicity. Each makes choices that suit its target.

License/governance

RocksDB is BSD+Apache (clean). LevelDB is BSD-3 (clean). Forks happen for governance: Pebble being inside Cockroach lets them ship features faster than upstream RocksDB might accept.

LevelDB: The Original

~20k lines of C++. Single-thread compaction. Rock-solid simplicity.

Google open-sourced LevelDB in 2011, distilled from Bigtable's LSM kernel. Designed for single-process embedded use: Chrome's IndexedDB, Bitcoin Core's chainstate, smaller databases. The architecture: one memtable, one immutable memtable, sorted runs at L0-L7, single-threaded background compaction.

Strengths: minimal API surface (Open, Put, Get, Delete, Iterator), small codebase that one engineer can read in a week, rock-solid stability after a decade. Weaknesses: no column families, no transactions, no parallelism — the single compaction thread becomes the throughput ceiling. No bloom filter tuning. No snapshots beyond the basic GetSnapshot.

In maintenance mode at Google. New projects don't choose LevelDB; existing dependents (Chrome, Bitcoin Core) maintain it because porting away is more risk than the maintenance cost.

RocksDB: The Server-Side Fork

Facebook's evolution toward production-grade embedded storage.

Forked from LevelDB in 2012 to support Facebook's MyRocks (MySQL with RocksDB storage engine) and ZippyDB (a distributed KV store). Major additions over LevelDB:

- multi-threaded compaction (max_background_jobs)
- column families with shared WAL
- transactions (TransactionDB, OptimisticTransactionDB, WritePrepared)
- tunable bloom filters (full, partitioned, ribbon, prefix)
- universal and FIFO compaction styles
- backup engine, checkpoint API
- statistics, perf context, IO tracing
- range deletes (DeleteRange)
- compression per level
- block cache with priorities
- write batch with index
- merge operators (counters, set, append)
- WriteBufferManager for cross-CF memtable budget

The cost: ~250k LOC, hundreds of options, learning curve. Production deployments (TiKV, MyRocks, Yugabyte, ArangoDB) all do significant tuning work. Compared to LevelDB, you get a far more capable engine but more rope to hang yourself.

Pebble: Go-Native LSM

Cockroach Labs' from-scratch Go implementation, since 2018.

Pebble was started because Cgo overhead made RocksDB-via-Go untenable for hot-path use cases. Reimplementing in Go required ~80k lines of code; the SST format intentionally stays readable by RocksDB (with care; not a guaranteed compatibility), so backups could be ported.

Pebble's distinguishing features over RocksDB:

- Cgo-free; native Go for the entire stack
- DELRANGE optimized for Cockroach's MVCC-version-deletes
- value blocks (BlobDB-style key/value separation, optional)
- improved bulk ingestion path
- better-tuned default options for typical Cockroach workloads
- tighter integration with Go's runtime: profiling, scheduling, GC

What Pebble doesn't have: the full breadth of RocksDB's tuning (specific compaction strategies, some advanced options), older or non-Cockroach use case features. Pebble is targeted at CockroachDB's use case first; general use is a side benefit.

BadgerDB: The WiscKey Variant

Key-value separation: keys in LSM, values in a separate log.

BadgerDB (Dgraph, 2018) implements the WiscKey design: only keys go through the LSM tree; values are appended to a separate value log file. Each LSM entry stores a value pointer (file offset) instead of the value itself.

Result: compaction rewrites only keys (small) — write amp drops dramatically for workloads with large values. The cost: every Get does an extra file read to fetch the value, and the value log requires its own garbage collection (when keys are deleted, the corresponding value-log entries become orphans).

RocksDB:        keys+values rewritten in compaction → high WA on big values
BadgerDB:       only keys rewritten → low WA, extra read per Get

Where it wins: NoSQL-style workloads with values in the 1-100 KB range — big-value compaction in RocksDB amplifies WA dramatically; BadgerDB's keys-only LSM stays compact. Where it loses: small KV workloads where the value is in the same cache line as the key — RocksDB has no extra read, BadgerDB's pointer dereference becomes pure overhead.

Used by Dgraph, NATS JetStream, IPFS, and various Go-native projects where the workload fits.

Comparison Matrix

Feature	LevelDB	RocksDB	Pebble	BadgerDB
Language	C++	C++	Go	Go
Multi-threaded compaction	no	yes	yes	yes
Column families	no	yes	no	no
Transactions	no	yes	yes (basic)	yes
Bloom filter tuning	fixed	extensive	good	good
K-V separation	no	BlobDB	value blocks	core design
Range deletes	no	yes	yes (excellent)	limited
Snapshots	basic	full	full	full
Code size	~20k LOC	~250k LOC	~80k LOC	~30k LOC
Major users	Chrome, Bitcoin	TiKV, MyRocks	Cockroach	Dgraph, NATS

FAQ

Why did Facebook fork LevelDB into RocksDB?

LevelDB was designed for a single-user desktop workload (Chrome's IndexedDB, Bigtable) — single-threaded compaction, no column families, basic tuning options. Facebook needed a server-side embedded engine with multi-threaded compaction, configurable bloom filters, column families, transactions, and tunable amplification. They forked in 2012 and have continued to diverge ever since. RocksDB now shares ~30% of the original LevelDB code; the rest has been rewritten.

Is BadgerDB just RocksDB in Go?

No, BadgerDB has a fundamentally different design: it separates keys from values (the 'WiscKey' paper). Keys live in an LSM tree; values live in a separate value log. This dramatically reduces write amplification (only keys get rewritten during compaction) at the cost of a value-pointer dereference on each Get. Excellent for workloads with large values; less optimal for small KV pairs where the LSM structure already dominates. Used by Dgraph, NATS JetStream, and others where it fits.

What's the relationship between RocksDB and Pebble?

Pebble is a from-scratch Go reimplementation by Cockroach Labs, started in 2018. It mimics RocksDB's LSM structure and SST format (so backups can be moved between them with care) but adds Cockroach-specific extensions like better range deletes, value blocks (BlobDB-like), and tighter MVCC integration. Pebble doesn't try to be 100% RocksDB-compatible — it diverges where Cockroach's needs differ. Now used by CockroachDB and a growing list of Go projects.

Should I use LevelDB today?

Almost never for new projects. LevelDB is in maintenance mode at Google — minimal new development. It's still embedded in Chrome (IndexedDB), Bitcoin Core (chainstate), and a few other long-lived consumers. For new code, RocksDB or one of its forks is the better choice: more features, more active development, broader tuning. Use LevelDB only if you specifically need its small footprint (~10-20k LOC vs RocksDB's 200k+) and minimal feature set.

Are SST files compatible across RocksDB, LevelDB, Pebble?

Largely no. RocksDB extended LevelDB's SST format with column-family info, larger metadata, and version markers; LevelDB cannot read RocksDB SSTs. Pebble's SSTs are slightly diverged from RocksDB's (different magic, different version). External SST ingest (db->IngestExternalFile) only works for properly-versioned files. Backups should always use the engine's own backup tool (BackupEngine, etc.), not raw SST copies between engines.

Why aren't there more LSM stores?

Because writing one is hard. The LSM kernel itself is ~5k LOC; the surrounding correctness machinery (recovery, snapshot semantics, atomic writes, compaction priorities, statistics, metrics) is 100k+ LOC of subtle code. Newer stores like Pebble had to invest years to reach production-ready state. Most projects just wrap RocksDB. The exceptions (BadgerDB, FoundationDB's redwood, Pebble) have specific reasons — language affinity, fundamentally different design (WiscKey), or tighter integration with a higher-level system.