RocksDB LSM Tree

RocksDB is built around the Log-Structured Merge tree — a write-optimized data structure that buffers writes in an in-memory memtable, flushes that memtable as a sorted run to L0 on disk, and progressively merges sorted runs into larger files at higher levels (L1, L2, …) as data accumulates. Reads check the memtable first, then each level in order. This page traces the write path from a single Put through the levels, explains why every LSM tree must navigate the three-way tradeoff between write, read, and space amplification, and shows the on-disk layout that makes a read possible at all.

Write Path Through the LSM

Put → memtable → flush → compaction up the levels Put(k, v) → WAL append + memtable memtable (skiplist) in-memory, sorted immutable flush worker takes it L0 SST file sorted, immutable background compaction merges level N into level N+1 L0 (4-12 files) L1 (~256 MB) L2 (~2.5 GB) L3 (~25 GB) each level holds ~10x the data of the level above Read path: memtable → L0 (every file) → L1 (1 file via fence keys) → L2 → … until found, or all checked

Key Numbers

Default memtable
64 MB
L0 trigger
4 files
Level multiplier
10x
Default SST
64 MB
Block size
4 KB
Levels typical
5-7
WAL sync
configurable

Why LSM Beats B-Tree for Writes

Sequential I/O wins
B+trees write to random pages; an SSD's FTL pays write amplification. LSM trees write big sequential files. SSDs love sequential writes — 10-100x more sustained throughput.
Batched compaction
Compaction is async. The hot path (Put) stays cheap: a memtable insert. The expensive merge happens later, in the background, when the disk is otherwise idle.
Tunable tradeoffs
LSM exposes write/read/space amp as configurable knobs. You can build a fully read-optimized LSM (low fanout, frequent compaction) or write-optimized (deep levels, leveled compaction off). B-trees give you one fixed shape.

The Write Path

From Put to durably-committed bytes.

db.Put(key, value);

1. WAL append: write the operation to the write-ahead log (sequential)
2. Memtable insert: add (key, value) to the active memtable (skiplist, in-memory)
3. (optional) WAL fsync if write_options.sync = true
4. Return success to caller

When memtable hits write_buffer_size (default 64 MB):
5. Memtable becomes immutable; a new memtable replaces it
6. Background flush worker writes immutable memtable as an L0 SST file
7. WAL entries covered by the flushed memtable can now be discarded

Notice that steps 1-2 are constant time per write. The memtable is a skiplist: O(log n) insert, ordered by key. The expensive parts (steps 5-7 and beyond) happen asynchronously. The write path latency is dominated by the WAL fsync; if sync=false the WAL goes to the page cache and you trust the kernel for durability.

SST Files: Sorted String Tables

The on-disk format. Immutable. Sorted. Indexed.

An SST file is a sequence of blocks: data blocks (key-value pairs), an index block (one entry per data block: largest key in that block → block offset), filter blocks (bloom filter), and a footer with metadata pointers. Default block size is 4 KB; SST file size is typically 64 MB (target_file_size_base).

SST file:
+--------------------+
| data block 0       |  ← KV pairs: keys lexicographically increasing
+--------------------+
| data block 1       |  ← compressed by default (Snappy/LZ4/ZSTD)
+--------------------+
| ... data blocks N  |
+--------------------+
| index block        |  ← largest_key_per_data_block → offset
+--------------------+
| filter block       |  ← bloom filter for fast negative lookups
+--------------------+
| footer (metadata)  |  ← magic, version, index offset, filter offset
+--------------------+

To find a key in an SST: read the footer (last 53 bytes), seek to the index block, binary search to find the data block whose range contains the key, read that block (one disk seek), decompress, binary search within the block. The bloom filter check happens before the data-block read — if the filter says "key not present," skip the read entirely.

L0: The Special Level

L0 files come straight from memtable flushes and overlap.

Each memtable flush produces one L0 SST. Two consecutive flushes can both contain the same key (if it was written, then written again). L0 is the only level where files have overlapping key ranges. As a consequence, finding a key in L0 requires checking every L0 file from newest to oldest.

RocksDB triggers L0→L1 compaction when L0 has level0_file_num_compaction_trigger files (default 4). When L0 grows to level0_slowdown_writes_trigger files (default 20), writes are throttled. At level0_stop_writes_trigger (default 36) writes block entirely until compaction catches up. This is the L0 stall — sustained write rate exceeding compaction capacity.

Levels L1+: The Sorted Run

Each level is one sorted run, partitioned across many files.

L1 and beyond are key-disjoint: the union of files at a level forms one sorted run, with each individual SST file holding a non-overlapping key range. The level's MANIFEST records each file's smallest_key and largest_key, ordered lexicographically.

To find a key in L1+: binary search the MANIFEST's sorted file list to find the one file whose [smallest, largest] range contains the key. At most one file per level. Read its bloom filter; if positive, read the data block. This makes L1+ point lookups O(log files-per-level) independent of the total number of keys per level.

This is what fence keys buy you: even with petabytes of data spread across thousands of SST files, a single point lookup at the deepest level is one binary search and one block read.

Compaction Picks Files to Merge

Leveled vs Universal vs FIFO compaction styles.

Leveled compaction (the default and most common): when L_N exceeds its target size, pick one file from L_N and all overlapping files from L_(N+1), merge into new L_(N+1) files. Bounded write amplification per byte (~10x for 10x fanout) but consistent space amplification.

Universal compaction: instead of strict per-level sizes, accumulate sorted runs and merge whenever the size ratio between adjacent runs exceeds a threshold. Lower write amplification (a write hits compaction fewer times) but worse space amplification — at the moment of major compaction, the input files plus output file are all on disk.

FIFO compaction: drop the oldest SST when total size exceeds a cap. Used for time-series data with retention. No merging at all; data simply ages out.

The Three Amplifications

The fundamental LSM tradeoff.

TypeWhat it measuresCaused byMitigated by
Write ampbytes written to disk per logical bytecompaction rewriting same data N timesfewer levels, smaller multiplier
Read ampblocks read per Getchecking multiple levelsbloom filters, fewer files per level
Space ampdisk used / logical dataobsolete versions, compaction pendingaggressive compaction, leveled style

You cannot minimize all three at once. Aggressive compaction reduces space and read amp but increases write amp. Deep levels reduce write amp but increase read amp. Tuning is workload- and storage-medium-specific.

FAQ

Why is the LSM tree better than a B+tree for writes?

B+trees update in place: every write triggers a random write to the leaf page on disk. LSM trees buffer writes in memory (the memtable) and flush sequentially. Then merging (compaction) is also sequential — writing big files start to finish. Sequential writes are 10-100x faster than random writes on both SSD (avoiding write amplification at the FTL) and HDD (avoiding seeks). The LSM tradeoff: you pay write amplification later in compaction, where it's batched and async, instead of synchronously per write.

Why are levels arranged as size 10x each?

max_bytes_for_level_multiplier defaults to 10: each level holds 10x the data of the level above. With L0 at ~256 MB, L1 is 256 MB, L2 is 2.5 GB, L3 25 GB, L4 250 GB. The 10x ratio is the LSM-tree paper's recommendation balancing write amplification (lower ratio = more frequent compactions per byte) and read amplification (higher ratio = fewer levels to check). For very write-heavy workloads, a smaller multiplier (5) reduces space amplification. For mostly-reads, larger ratios (20) reduce levels.

What's a sorted run?

An immutable file (or set of files) containing key-value pairs in sorted order with no duplicate keys among them. L0 is special — each L0 file is its own sorted run, and they may overlap with each other (because they came from independent flushes). L1+ has exactly one sorted run per level (or one sorted run subdivided into many SST files with disjoint key ranges, depending on compaction style). Reading a key requires checking the memtable, then each L0 file, then the relevant SST in L1, L2, …

What's the role of fence keys?

Each SST file's metadata includes its smallest and largest key. The level's manifest indexes all files by key range. To find a key in L4, the SST chooser does a binary search on fence keys to identify the one file (in L1+) that could contain it. Without fence keys, every read would scan every file in the level. They're the reason point lookups in LSM are O(log n × levels) instead of O(n).

What's the LSM tradeoff in one sentence?

LSM trees trade write amplification (compaction rewrites the same data multiple times across levels) for read amplification (you may need to check multiple levels to find a key) and space amplification (compaction pending data takes extra space). All three are tunable; you can't minimize all three at once.

Why is L0 special?

L0 files are emitted directly from memtable flush, so they may have overlapping key ranges with each other (memtable A is flushed, then memtable B is flushed independently — they could both contain the same key). To find a key in L0, RocksDB must check every L0 file. To prevent unbounded read amp, L0→L1 compaction is triggered when L0 has too many files (level0_file_num_compaction_trigger, default 4). After L1, files are key-disjoint within a level — only one file per level needs checking for any given key.