RocksDB LSM Tree
RocksDB is built around the Log-Structured Merge tree — a write-optimized data structure
that buffers writes in an in-memory memtable, flushes that memtable as a sorted
run to L0 on disk, and progressively merges sorted runs into larger files at
higher levels (L1, L2, …) as data accumulates. Reads check the
memtable first, then each level in order. This page traces the write path from a single
Put through the levels, explains why every LSM tree must navigate the
three-way tradeoff between write, read, and space amplification, and shows the on-disk
layout that makes a read possible at all.
Write Path Through the LSM
Key Numbers
Why LSM Beats B-Tree for Writes
The Write Path
From Put to durably-committed bytes.
db.Put(key, value); 1. WAL append: write the operation to the write-ahead log (sequential) 2. Memtable insert: add (key, value) to the active memtable (skiplist, in-memory) 3. (optional) WAL fsync if write_options.sync = true 4. Return success to caller When memtable hits write_buffer_size (default 64 MB): 5. Memtable becomes immutable; a new memtable replaces it 6. Background flush worker writes immutable memtable as an L0 SST file 7. WAL entries covered by the flushed memtable can now be discarded
Notice that steps 1-2 are constant time per write. The memtable is a skiplist: O(log n)
insert, ordered by key. The expensive parts (steps 5-7 and beyond) happen asynchronously.
The write path latency is dominated by the WAL fsync; if sync=false the WAL
goes to the page cache and you trust the kernel for durability.
SST Files: Sorted String Tables
The on-disk format. Immutable. Sorted. Indexed.
An SST file is a sequence of blocks: data blocks (key-value pairs), an index block (one
entry per data block: largest key in that block → block offset), filter blocks (bloom
filter), and a footer with metadata pointers. Default block size is 4 KB; SST file size is
typically 64 MB (target_file_size_base).
SST file: +--------------------+ | data block 0 | ← KV pairs: keys lexicographically increasing +--------------------+ | data block 1 | ← compressed by default (Snappy/LZ4/ZSTD) +--------------------+ | ... data blocks N | +--------------------+ | index block | ← largest_key_per_data_block → offset +--------------------+ | filter block | ← bloom filter for fast negative lookups +--------------------+ | footer (metadata) | ← magic, version, index offset, filter offset +--------------------+
To find a key in an SST: read the footer (last 53 bytes), seek to the index block, binary search to find the data block whose range contains the key, read that block (one disk seek), decompress, binary search within the block. The bloom filter check happens before the data-block read — if the filter says "key not present," skip the read entirely.
L0: The Special Level
L0 files come straight from memtable flushes and overlap.
Each memtable flush produces one L0 SST. Two consecutive flushes can both contain the same key (if it was written, then written again). L0 is the only level where files have overlapping key ranges. As a consequence, finding a key in L0 requires checking every L0 file from newest to oldest.
RocksDB triggers L0→L1 compaction when L0 has level0_file_num_compaction_trigger
files (default 4). When L0 grows to level0_slowdown_writes_trigger files
(default 20), writes are throttled. At level0_stop_writes_trigger (default 36)
writes block entirely until compaction catches up. This is the L0 stall
— sustained write rate exceeding compaction capacity.
Levels L1+: The Sorted Run
Each level is one sorted run, partitioned across many files.
L1 and beyond are key-disjoint: the union of files at a level forms one sorted run, with each individual SST file holding a non-overlapping key range. The level's MANIFEST records each file's smallest_key and largest_key, ordered lexicographically.
To find a key in L1+: binary search the MANIFEST's sorted file list to find the one file whose [smallest, largest] range contains the key. At most one file per level. Read its bloom filter; if positive, read the data block. This makes L1+ point lookups O(log files-per-level) independent of the total number of keys per level.
This is what fence keys buy you: even with petabytes of data spread across thousands of SST files, a single point lookup at the deepest level is one binary search and one block read.
Compaction Picks Files to Merge
Leveled vs Universal vs FIFO compaction styles.
Leveled compaction (the default and most common): when L_N exceeds its target size, pick one file from L_N and all overlapping files from L_(N+1), merge into new L_(N+1) files. Bounded write amplification per byte (~10x for 10x fanout) but consistent space amplification.
Universal compaction: instead of strict per-level sizes, accumulate sorted runs and merge whenever the size ratio between adjacent runs exceeds a threshold. Lower write amplification (a write hits compaction fewer times) but worse space amplification — at the moment of major compaction, the input files plus output file are all on disk.
FIFO compaction: drop the oldest SST when total size exceeds a cap. Used for time-series data with retention. No merging at all; data simply ages out.
The Three Amplifications
The fundamental LSM tradeoff.
| Type | What it measures | Caused by | Mitigated by |
|---|---|---|---|
| Write amp | bytes written to disk per logical byte | compaction rewriting same data N times | fewer levels, smaller multiplier |
| Read amp | blocks read per Get | checking multiple levels | bloom filters, fewer files per level |
| Space amp | disk used / logical data | obsolete versions, compaction pending | aggressive compaction, leveled style |
You cannot minimize all three at once. Aggressive compaction reduces space and read amp but increases write amp. Deep levels reduce write amp but increase read amp. Tuning is workload- and storage-medium-specific.
FAQ
Why is the LSM tree better than a B+tree for writes?
B+trees update in place: every write triggers a random write to the leaf page on disk. LSM trees buffer writes in memory (the memtable) and flush sequentially. Then merging (compaction) is also sequential — writing big files start to finish. Sequential writes are 10-100x faster than random writes on both SSD (avoiding write amplification at the FTL) and HDD (avoiding seeks). The LSM tradeoff: you pay write amplification later in compaction, where it's batched and async, instead of synchronously per write.
Why are levels arranged as size 10x each?
max_bytes_for_level_multiplier defaults to 10: each level holds 10x the data of the level above. With L0 at ~256 MB, L1 is 256 MB, L2 is 2.5 GB, L3 25 GB, L4 250 GB. The 10x ratio is the LSM-tree paper's recommendation balancing write amplification (lower ratio = more frequent compactions per byte) and read amplification (higher ratio = fewer levels to check). For very write-heavy workloads, a smaller multiplier (5) reduces space amplification. For mostly-reads, larger ratios (20) reduce levels.
What's a sorted run?
An immutable file (or set of files) containing key-value pairs in sorted order with no duplicate keys among them. L0 is special — each L0 file is its own sorted run, and they may overlap with each other (because they came from independent flushes). L1+ has exactly one sorted run per level (or one sorted run subdivided into many SST files with disjoint key ranges, depending on compaction style). Reading a key requires checking the memtable, then each L0 file, then the relevant SST in L1, L2, …
What's the role of fence keys?
Each SST file's metadata includes its smallest and largest key. The level's manifest indexes all files by key range. To find a key in L4, the SST chooser does a binary search on fence keys to identify the one file (in L1+) that could contain it. Without fence keys, every read would scan every file in the level. They're the reason point lookups in LSM are O(log n × levels) instead of O(n).
What's the LSM tradeoff in one sentence?
LSM trees trade write amplification (compaction rewrites the same data multiple times across levels) for read amplification (you may need to check multiple levels to find a key) and space amplification (compaction pending data takes extra space). All three are tunable; you can't minimize all three at once.
Why is L0 special?
L0 files are emitted directly from memtable flush, so they may have overlapping key ranges with each other (memtable A is flushed, then memtable B is flushed independently — they could both contain the same key). To find a key in L0, RocksDB must check every L0 file. To prevent unbounded read amp, L0→L1 compaction is triggered when L0 has too many files (level0_file_num_compaction_trigger, default 4). After L1, files are key-disjoint within a level — only one file per level needs checking for any given key.