RocksDB Internals
RocksDB is not a database — it's an embeddable key-value storage engine. But it sits underneath an astonishing fraction of modern infrastructure: TiKV, CockroachDB (until v22), MyRocks, Kafka Streams' state stores, Apache Flink's RocksDB state backend, MongoDB's WiredTiger alternatives, and dozens of bespoke systems. Understanding its LSM tree, compaction model, and three-way amplification tradeoff is the closest thing to a universal language for write-optimized storage in 2026.
Based on the RocksDB 8.x source tree, the original BigTable LSM paper, and Mark Callaghan's MyRocks production notes.
The LSM Tree, End to End
Key Numbers
Why RocksDB Exists
Compaction Strategies
Leveled, universal, FIFO, and tiered compaction — how RocksDB trades write amplification for read amplification, and when each shape wins
LSM Tree Internals
MemTable, immutable MemTable, SST levels, and the journey of a single write from WAL to L6
Bloom Filters & Block Cache
Why every read touches a bloom filter, what fits in the block cache, and how partitioned filters keep negative lookups O(1)
Write & Read Amplification
The three-amplifier tradeoff (write, read, space) — why you can only optimize two, and which knob moves which
Column Families
Multiple logical keyspaces, one shared WAL — how to organise heterogeneous data without paying for separate databases
Transactions & Snapshots
Pessimistic vs optimistic transactions, sequence numbers as MVCC, and how a snapshot is just a frozen sequence number
Tuning for Workload
Write-heavy vs read-heavy vs space-constrained — the half-dozen options that actually matter and the dozen that don't
RocksDB vs LevelDB vs Pebble
The fork tree of LSM engines: what RocksDB added, what Pebble (Go) reimplemented, what BadgerDB does differently
The Write Path
Every Put, Delete, or Merge follows the same pipeline. The key/value
and a sequence number are first appended to the Write-Ahead Log. By default the WAL is fsync'd
according to your WriteOptions::sync setting; with sync=false the OS may buffer the
write and you risk losing the trailing N seconds on a crash. With sync=true every write blocks
on a physical fsync — durable, but you'll measure throughput in single-digit thousands of ops/sec on a
consumer SSD.
Once the WAL append returns, the entry is inserted into the active MemTable — by default
a concurrent skiplist sized at write_buffer_size (default 64 MB). When it fills, the engine
atomically swaps it to the immutable list and creates a fresh active MemTable. A background flush thread
then writes the immutable MemTable as a new L0 SSTable, bloom filter and index built inline, and the
corresponding WAL is finally truncated.
The whole write path is a sequential append plus an in-memory skiplist insert. There is no random I/O on the hot path. This is the single most important property of LSM trees: writes are O(log N) in memory and O(1) on disk per write, regardless of dataset size.
SSTables: Sorted, Immutable, Block-Indexed
An SSTable (Sorted String Table) is a single file with sorted key/value pairs grouped into blocks (default 4 KB), an index block listing the first key of each data block, an optional filter block (bloom or ribbon), a metaindex, and a footer. Once written, an SST is never modified — only deleted by compaction. This immutability is what makes snapshots cheap and reads lock-free.
Reading a key from an SST: binary search the index block to find the candidate data block, optionally consult the bloom filter to short-circuit if the key definitely isn't present, then load the data block (typically from the block cache, falling back to disk), binary search within it. With block cache warm and bloom filters enabled, a point lookup in a level is one memory comparison; on a miss, zero disk I/O.
Compaction: The Heart of LSM Tradeoffs
L0 SSTables can have overlapping key ranges (each is a flushed MemTable). All other levels are non-overlapping: any key appears in at most one SST per level. Compaction rewrites SSTs to maintain this invariant, merge-sort style, dropping superseded versions and tombstones along the way.
Leveled compaction (the default): when level Lk exceeds its size budget (e.g., L1 = 100 MB, L2 = 1 GB, L3 = 10 GB…), the engine picks one SST from Lk and merges it with all overlapping SSTs in Lk+1. This bounds space amplification (each level is a strict subset of newer data plus disjoint older keys) at the cost of high write amplification: a key may be rewritten 10× as it descends seven levels.
Universal compaction: keep all SSTs at L0 and merge them in size-tiered batches (similar to Cassandra's STCS). Lower write amplification (each key is rewritten log(N/MemTable) times), higher space amplification (up to 2× the dataset), and dramatic read amplification (you may probe many overlapping tables). Universal wins for write-heavy, retention-bounded workloads.
FIFO compaction: just delete the oldest SST when total size exceeds a budget. No merging. Useful for time-series data where retention by age is the primary concern.
Tiered + leveled (hybrid): levels 0–N are tiered (universal-style), Ns–end are leveled.
RocksDB's level_compaction_dynamic_level_bytes=true implements a smart variant where the
last level absorbs ~90% of the data and earlier levels auto-size to keep amplification balanced.
The Three Amplifications
RocksDB tuning lives and dies by three numbers. Write amplification (WA): total bytes written to disk divided by bytes the user wrote. Leveled compaction with 10× ratios and 7 levels gives WA ≈ 50–70x in the steady state. Read amplification (RA): I/Os performed to serve one logical read. Bounded by the number of levels probed (worst case all of them; bloom filters drop this to ~1 in practice). Space amplification (SA): bytes on disk divided by live bytes. Leveled compaction keeps this under ~1.1; universal can swing to 2× during compaction.
The unkind theorem: you can pick any two. Lowering WA (universal compaction) raises SA. Lowering SA (leveled) raises WA. Lowering RA (more bloom bits, larger block cache, fewer levels) raises memory cost. Workload tuning is choosing your two.
Bloom Filters and the Block Cache
A bloom filter for each SST occupies ~10 bits per key (configurable) and answers "is this key definitely not in the file?" in two memory loads. False positive rate drops below 1% at 10 bits/key. Without bloom filters, a point read traverses every level until it finds the key or runs out — devastating on cold caches. With them, the typical read does one comparison per level, all in CPU cache.
The block cache (default 8 MB LRU; set this much higher in production) caches uncompressed data blocks. RocksDB also offers a compressed block cache for working sets larger than RAM but smaller than disk, and partitioned indexes/filters that page in on demand rather than residing entirely in RAM.
Column Families: Logical Keyspaces, One WAL
Column families are independent keyspaces sharing a single WAL and database directory. Each has its own
MemTable, SSTs, and compaction settings. This is how RocksDB-based systems separate, say, the data
keyspace from the secondary index keyspace from the metadata keyspace — different compaction policies,
different bloom configs, but atomic cross-CF writes via the shared WAL and a multi-CF
WriteBatch.
MyRocks uses one CF per partition. TiKV uses CFs to separate write data from raft log entries. Kafka Streams uses one CF per logical state store. The pattern is everywhere.
Transactions: Optimistic, Pessimistic, Snapshots
RocksDB exposes two transaction implementations on top of its sequence-number-based MVCC.
Pessimistic (TransactionDB) takes locks on keys at Put time, conflicting with
other transactions until commit or rollback. Optimistic (OptimisticTransactionDB) takes
no locks but verifies at commit time that no other transaction wrote to its read set; conflicts roll back.
Snapshots are essentially a frozen sequence number — reads see only entries with seqnum ≤ that snapshot.
Snapshots are O(1) to take and free until they hold back compaction (preventing tombstone collection).
Long-held snapshots are a classic RocksDB foot-gun: they prevent compaction from dropping superseded versions, growing space amplification arbitrarily. Always set timeouts on snapshots.
Merge Operators
A unique RocksDB feature: register a function that combines a base value with a sequence of "merge" writes. This lets you express counters, append-to-list, JSON-patch-like updates without a read-modify-write cycle. The merge operator runs lazily — at compaction or read time — when the engine encounters a stack of merges. Critical for high-throughput counter workloads where read-modify-write would serialize.
Tradeoffs and When Not To Use RocksDB
RocksDB is hard to operate. Its 200+ tuning knobs hide a small number that genuinely matter behind a large number that look like they do. Compaction stalls, write stalls, and amplification surprises are real. The library has no replication, no SQL, no transactions across processes, no built-in backup tooling. It is a component.
Use it when you're building a database or a storage-heavy service and need the LSM tradeoff (high write throughput, sequential ingestion, large datasets bigger than RAM). Don't use it when you'd be better served by a real database (PostgreSQL, MySQL) with a proven operations story, or by a managed K/V store (DynamoDB, Spanner) where you're paying for the operational complexity to be someone else's problem.
RocksDB vs Other LSM Engines
| RocksDB | LevelDB | Pebble (Go) | BadgerDB (Go) | |
|---|---|---|---|---|
| Origin | Facebook fork of LevelDB | Google (Sanjay Ghemawat) | Cockroach Labs | Dgraph Labs |
| Language | C++ | C++ | Go (CGO-free) | Go |
| Threading | Multi-threaded compaction | Single-threaded | Multi-threaded | Multi-threaded |
| Compaction strategies | Leveled, universal, FIFO | Leveled only | Leveled (RocksDB-compatible) | Leveled (key/value separation) |
| Column families | Yes | No | No (different model) | Streams API |
| Transactions | Pessimistic + optimistic | No | Snapshot batches | SSI transactions |
| Key/value separation | BlobDB (optional) | No | Built-in (per-key threshold) | WiscKey (default) |
| Used by | MyRocks, TiKV, Flink, Kafka Streams | BigTable (legacy), early Bitcoin | CockroachDB (since v22) | Dgraph, Hypermode |
FAQ
Why does RocksDB need so many compaction threads?
Compaction is essentially a background merge sort that reads and writes huge volumes of data. With leveled compaction at 10× ratios, sustained writes generate ~50× their volume in compaction I/O. A single thread can't keep up: the LSM accumulates L0 files, write stalls trigger, and tail latency explodes. max_background_compactions at 4–8 is a typical baseline for write-heavy workloads.
What is a "write stall" and how do I avoid it?
RocksDB throttles or stops writes when (a) the active MemTable + N immutable MemTables are all full and the flush thread can't keep up, or (b) L0 has too many files (default ≥20 triggers slowdown, ≥36 stops writes). Avoid by sizing max_write_buffer_number, level0_slowdown_writes_trigger, and level0_stop_writes_trigger generously, and by giving compaction enough threads.
How do I pick a compaction strategy?
Start with leveled compaction with level_compaction_dynamic_level_bytes=true. Switch to universal only if your workload is genuinely write-bound and you can tolerate ~2× space amplification during compactions. Use FIFO for time-series with hard retention. Almost everyone should start leveled.
Should I use BlobDB / key-value separation?
Yes if your values are large (≥1 KB) and write-heavy. Storing values in separate blob files cuts compaction cost dramatically — compaction now rewrites only keys + pointers, not the values themselves. The tradeoff is point lookups now do an extra blob file read per hit. For small values it's a loss; for large ones it's transformative.
How does TiKV / CockroachDB / MyRocks differ from raw RocksDB?
They use RocksDB (or Pebble) as the storage engine and add a Raft consensus layer, SQL parsing, transactions across nodes, schema, and operations tooling. RocksDB itself has none of those. Saying "TiKV is RocksDB" is like saying "PostgreSQL is just a heap file" — technically the truth and entirely missing the point.
Why did CockroachDB switch to Pebble?
RocksDB is C++; CockroachDB is Go. Calling RocksDB from Go via CGO has nontrivial overhead (GC interaction, stack management). Pebble is a Go-native reimplementation, RocksDB-format-compatible, that eliminates the CGO boundary. The same arguments don't apply to Java, C++, or Rust ecosystems where RocksDB remains dominant.
What's the relationship between RocksDB and the Tigerbeetle / FoundationDB / Spanner engines?
None directly. Tigerbeetle uses a custom WAL + write-ahead-only design suited to financial workloads. FoundationDB uses sqlite as a leaf engine. Spanner uses a custom Bigtable-derived storage layer over Colossus. The LSM idea is shared; the implementations have diverged radically.