RocksDB Internals

RocksDB is not a database — it's an embeddable key-value storage engine. But it sits underneath an astonishing fraction of modern infrastructure: TiKV, CockroachDB (until v22), MyRocks, Kafka Streams' state stores, Apache Flink's RocksDB state backend, MongoDB's WiredTiger alternatives, and dozens of bespoke systems. Understanding its LSM tree, compaction model, and three-way amplification tradeoff is the closest thing to a universal language for write-optimized storage in 2026.

Based on the RocksDB 8.x source tree, the original BigTable LSM paper, and Mark Callaghan's MyRocks production notes.

The LSM Tree, End to End

Write Path Put(k,v) WAL append + fsync MemTable skiplist (in RAM) Immutable MemTable awaiting flush L0 SSTable flush output On-Disk Levels (leveled compaction, 10x size ratio) L0 — overlapping L1 — sorted, ~100 MB L2 — sorted, ~1 GB L3 — sorted, ~10 GB L4 — sorted, ~100 GB L5/L6 — sorted, TB-scale, ~90% of total bytes Read Path 1. MemTable + immutable 2. L0 (each file, newest first) 3. L1..Ln (binary search) Bloom filter rejects most levels w/o I/O

Key Numbers

Default block size
4 KB
Default level multiplier
10x
Levels (typical)
7 (L0–L6)
Bloom FPR
~1%
MemTable default
64 MB
SST max size
64 MB (L1+)
Max open files
configurable

Why RocksDB Exists

The Gap
B-trees were designed for spinning disks where seeks dominated. On flash, random writes wear out cells and write performance drops as the device fills. LevelDB demonstrated LSM trees as a flash-friendly alternative but was a single-threaded library missing tooling.
The Insight
If you batch random writes into sequential I/O via a log + memtable + sorted files, you turn the flash device's strongest property (sequential bandwidth) into your write path's hot path. Background compaction cleans up later, in the background, where latency budgets are looser.
The Result
Facebook forked LevelDB in 2012, made it multi-threaded, added compaction strategies, column families, transactions, snapshots, and merge operators. RocksDB is now the de facto embedded engine — from MyRocks to Kafka Streams — because it can be tuned for almost any workload shape.
✦ Live

Compaction Strategies

Leveled, universal, FIFO, and tiered compaction — how RocksDB trades write amplification for read amplification, and when each shape wins

Coming soon

LSM Tree Internals

MemTable, immutable MemTable, SST levels, and the journey of a single write from WAL to L6

Coming soon

Bloom Filters & Block Cache

Why every read touches a bloom filter, what fits in the block cache, and how partitioned filters keep negative lookups O(1)

Coming soon

Write & Read Amplification

The three-amplifier tradeoff (write, read, space) — why you can only optimize two, and which knob moves which

Coming soon

Column Families

Multiple logical keyspaces, one shared WAL — how to organise heterogeneous data without paying for separate databases

Coming soon

Transactions & Snapshots

Pessimistic vs optimistic transactions, sequence numbers as MVCC, and how a snapshot is just a frozen sequence number

Coming soon

Tuning for Workload

Write-heavy vs read-heavy vs space-constrained — the half-dozen options that actually matter and the dozen that don't

Coming soon

RocksDB vs LevelDB vs Pebble

The fork tree of LSM engines: what RocksDB added, what Pebble (Go) reimplemented, what BadgerDB does differently

The Write Path

Every Put, Delete, or Merge follows the same pipeline. The key/value and a sequence number are first appended to the Write-Ahead Log. By default the WAL is fsync'd according to your WriteOptions::sync setting; with sync=false the OS may buffer the write and you risk losing the trailing N seconds on a crash. With sync=true every write blocks on a physical fsync — durable, but you'll measure throughput in single-digit thousands of ops/sec on a consumer SSD.

Once the WAL append returns, the entry is inserted into the active MemTable — by default a concurrent skiplist sized at write_buffer_size (default 64 MB). When it fills, the engine atomically swaps it to the immutable list and creates a fresh active MemTable. A background flush thread then writes the immutable MemTable as a new L0 SSTable, bloom filter and index built inline, and the corresponding WAL is finally truncated.

The whole write path is a sequential append plus an in-memory skiplist insert. There is no random I/O on the hot path. This is the single most important property of LSM trees: writes are O(log N) in memory and O(1) on disk per write, regardless of dataset size.

SSTables: Sorted, Immutable, Block-Indexed

An SSTable (Sorted String Table) is a single file with sorted key/value pairs grouped into blocks (default 4 KB), an index block listing the first key of each data block, an optional filter block (bloom or ribbon), a metaindex, and a footer. Once written, an SST is never modified — only deleted by compaction. This immutability is what makes snapshots cheap and reads lock-free.

Reading a key from an SST: binary search the index block to find the candidate data block, optionally consult the bloom filter to short-circuit if the key definitely isn't present, then load the data block (typically from the block cache, falling back to disk), binary search within it. With block cache warm and bloom filters enabled, a point lookup in a level is one memory comparison; on a miss, zero disk I/O.

Compaction: The Heart of LSM Tradeoffs

L0 SSTables can have overlapping key ranges (each is a flushed MemTable). All other levels are non-overlapping: any key appears in at most one SST per level. Compaction rewrites SSTs to maintain this invariant, merge-sort style, dropping superseded versions and tombstones along the way.

Leveled compaction (the default): when level Lk exceeds its size budget (e.g., L1 = 100 MB, L2 = 1 GB, L3 = 10 GB…), the engine picks one SST from Lk and merges it with all overlapping SSTs in Lk+1. This bounds space amplification (each level is a strict subset of newer data plus disjoint older keys) at the cost of high write amplification: a key may be rewritten 10× as it descends seven levels.

Universal compaction: keep all SSTs at L0 and merge them in size-tiered batches (similar to Cassandra's STCS). Lower write amplification (each key is rewritten log(N/MemTable) times), higher space amplification (up to 2× the dataset), and dramatic read amplification (you may probe many overlapping tables). Universal wins for write-heavy, retention-bounded workloads.

FIFO compaction: just delete the oldest SST when total size exceeds a budget. No merging. Useful for time-series data where retention by age is the primary concern.

Tiered + leveled (hybrid): levels 0–N are tiered (universal-style), Ns–end are leveled. RocksDB's level_compaction_dynamic_level_bytes=true implements a smart variant where the last level absorbs ~90% of the data and earlier levels auto-size to keep amplification balanced.

The Three Amplifications

RocksDB tuning lives and dies by three numbers. Write amplification (WA): total bytes written to disk divided by bytes the user wrote. Leveled compaction with 10× ratios and 7 levels gives WA ≈ 50–70x in the steady state. Read amplification (RA): I/Os performed to serve one logical read. Bounded by the number of levels probed (worst case all of them; bloom filters drop this to ~1 in practice). Space amplification (SA): bytes on disk divided by live bytes. Leveled compaction keeps this under ~1.1; universal can swing to 2× during compaction.

The unkind theorem: you can pick any two. Lowering WA (universal compaction) raises SA. Lowering SA (leveled) raises WA. Lowering RA (more bloom bits, larger block cache, fewer levels) raises memory cost. Workload tuning is choosing your two.

Bloom Filters and the Block Cache

A bloom filter for each SST occupies ~10 bits per key (configurable) and answers "is this key definitely not in the file?" in two memory loads. False positive rate drops below 1% at 10 bits/key. Without bloom filters, a point read traverses every level until it finds the key or runs out — devastating on cold caches. With them, the typical read does one comparison per level, all in CPU cache.

The block cache (default 8 MB LRU; set this much higher in production) caches uncompressed data blocks. RocksDB also offers a compressed block cache for working sets larger than RAM but smaller than disk, and partitioned indexes/filters that page in on demand rather than residing entirely in RAM.

Column Families: Logical Keyspaces, One WAL

Column families are independent keyspaces sharing a single WAL and database directory. Each has its own MemTable, SSTs, and compaction settings. This is how RocksDB-based systems separate, say, the data keyspace from the secondary index keyspace from the metadata keyspace — different compaction policies, different bloom configs, but atomic cross-CF writes via the shared WAL and a multi-CF WriteBatch.

MyRocks uses one CF per partition. TiKV uses CFs to separate write data from raft log entries. Kafka Streams uses one CF per logical state store. The pattern is everywhere.

Transactions: Optimistic, Pessimistic, Snapshots

RocksDB exposes two transaction implementations on top of its sequence-number-based MVCC. Pessimistic (TransactionDB) takes locks on keys at Put time, conflicting with other transactions until commit or rollback. Optimistic (OptimisticTransactionDB) takes no locks but verifies at commit time that no other transaction wrote to its read set; conflicts roll back. Snapshots are essentially a frozen sequence number — reads see only entries with seqnum ≤ that snapshot. Snapshots are O(1) to take and free until they hold back compaction (preventing tombstone collection).

Long-held snapshots are a classic RocksDB foot-gun: they prevent compaction from dropping superseded versions, growing space amplification arbitrarily. Always set timeouts on snapshots.

Merge Operators

A unique RocksDB feature: register a function that combines a base value with a sequence of "merge" writes. This lets you express counters, append-to-list, JSON-patch-like updates without a read-modify-write cycle. The merge operator runs lazily — at compaction or read time — when the engine encounters a stack of merges. Critical for high-throughput counter workloads where read-modify-write would serialize.

Tradeoffs and When Not To Use RocksDB

RocksDB is hard to operate. Its 200+ tuning knobs hide a small number that genuinely matter behind a large number that look like they do. Compaction stalls, write stalls, and amplification surprises are real. The library has no replication, no SQL, no transactions across processes, no built-in backup tooling. It is a component.

Use it when you're building a database or a storage-heavy service and need the LSM tradeoff (high write throughput, sequential ingestion, large datasets bigger than RAM). Don't use it when you'd be better served by a real database (PostgreSQL, MySQL) with a proven operations story, or by a managed K/V store (DynamoDB, Spanner) where you're paying for the operational complexity to be someone else's problem.

RocksDB vs Other LSM Engines

RocksDBLevelDBPebble (Go)BadgerDB (Go)
OriginFacebook fork of LevelDBGoogle (Sanjay Ghemawat)Cockroach LabsDgraph Labs
LanguageC++C++Go (CGO-free)Go
ThreadingMulti-threaded compactionSingle-threadedMulti-threadedMulti-threaded
Compaction strategiesLeveled, universal, FIFOLeveled onlyLeveled (RocksDB-compatible)Leveled (key/value separation)
Column familiesYesNoNo (different model)Streams API
TransactionsPessimistic + optimisticNoSnapshot batchesSSI transactions
Key/value separationBlobDB (optional)NoBuilt-in (per-key threshold)WiscKey (default)
Used byMyRocks, TiKV, Flink, Kafka StreamsBigTable (legacy), early BitcoinCockroachDB (since v22)Dgraph, Hypermode

FAQ

Why does RocksDB need so many compaction threads?

Compaction is essentially a background merge sort that reads and writes huge volumes of data. With leveled compaction at 10× ratios, sustained writes generate ~50× their volume in compaction I/O. A single thread can't keep up: the LSM accumulates L0 files, write stalls trigger, and tail latency explodes. max_background_compactions at 4–8 is a typical baseline for write-heavy workloads.

What is a "write stall" and how do I avoid it?

RocksDB throttles or stops writes when (a) the active MemTable + N immutable MemTables are all full and the flush thread can't keep up, or (b) L0 has too many files (default ≥20 triggers slowdown, ≥36 stops writes). Avoid by sizing max_write_buffer_number, level0_slowdown_writes_trigger, and level0_stop_writes_trigger generously, and by giving compaction enough threads.

How do I pick a compaction strategy?

Start with leveled compaction with level_compaction_dynamic_level_bytes=true. Switch to universal only if your workload is genuinely write-bound and you can tolerate ~2× space amplification during compactions. Use FIFO for time-series with hard retention. Almost everyone should start leveled.

Should I use BlobDB / key-value separation?

Yes if your values are large (≥1 KB) and write-heavy. Storing values in separate blob files cuts compaction cost dramatically — compaction now rewrites only keys + pointers, not the values themselves. The tradeoff is point lookups now do an extra blob file read per hit. For small values it's a loss; for large ones it's transformative.

How does TiKV / CockroachDB / MyRocks differ from raw RocksDB?

They use RocksDB (or Pebble) as the storage engine and add a Raft consensus layer, SQL parsing, transactions across nodes, schema, and operations tooling. RocksDB itself has none of those. Saying "TiKV is RocksDB" is like saying "PostgreSQL is just a heap file" — technically the truth and entirely missing the point.

Why did CockroachDB switch to Pebble?

RocksDB is C++; CockroachDB is Go. Calling RocksDB from Go via CGO has nontrivial overhead (GC interaction, stack management). Pebble is a Go-native reimplementation, RocksDB-format-compatible, that eliminates the CGO boundary. The same arguments don't apply to Java, C++, or Rust ecosystems where RocksDB remains dominant.

What's the relationship between RocksDB and the Tigerbeetle / FoundationDB / Spanner engines?

None directly. Tigerbeetle uses a custom WAL + write-ahead-only design suited to financial workloads. FoundationDB uses sqlite as a leaf engine. Spanner uses a custom Bigtable-derived storage layer over Colossus. The LSM idea is shared; the implementations have diverged radically.