RocksDB Amplification
An LSM tree pays for its sequential-write efficiency in three currencies: write amplification (the same bytes get rewritten as compaction merges levels), read amplification (a Get may need to check multiple levels), and space amplification (compaction-pending data and obsolete versions take extra disk). They're tunable and they trade off against each other — every workload's sweet spot is different. This page shows what each amp actually measures, how to read it from RocksDB's stats output, and the levers you have to push each one down at the cost of others.
The Three Amplifications Visualized
Key Numbers
The Pareto Frontier
Measuring Write Amplification
Read it directly from RocksDB's stats.
db->GetProperty("rocksdb.stats", &output);
** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Wr(MB/s) Rd(MB/s) Comp(sec) Comp(cnt) Avg(sec)
L0 4/0 256MB 1.0 0.0 0.0 0.0 12.5 12.5 0.0 1.0 25.6 0.0 488 1024 0.5
L1 8/0 1024MB 1.0 34.5 11.5 23.0 34.5 0.0 0.0 3.0 52.1 52.1 681 256 2.7
L2 65/0 10240MB 0.9 98.7 33.2 65.5 98.7 0.0 0.0 10.5 61.0 61.0 1660 64 25.9
Sum 77/0 11520MB 133.2 44.7 88.5 145.7 12.5 0.0 ...
Int 0/0 0KB 0.0 0.0 0.0 0.0 0.0 0.0 ...
Cumulative writes: 12.5 GB user + 145.7 GB compaction = 158.2 GB total
Per-byte WA: 158.2 / 12.5 = 12.66×
The Sum row gives total bytes written. Wnew on L0 is the bytes from memtable
flushes (logical writes); the rest is compaction. WA = Write(total) / Wnew(L0).
In this example, ~12.7× — well within healthy range for leveled compaction.
Per-level W-Amp shows where the cost lives. L2's 10.5× means each byte arriving at L2 was rewritten 10.5 times during L2 compaction (it's overlapping with ~10 files in L3 each compaction). Deep levels contribute most of the WA budget; if your data is mostly in L2-L4, that's where tuning helps.
Measuring Read Amplification
Block reads per Get, observable via tickers.
// Enable statistics collection
options.statistics = rocksdb::CreateDBStatistics();
// After workload runs:
auto stats = options.statistics;
auto block_reads = stats->getTickerCount(BLOCK_CACHE_DATA_MISS) +
stats->getTickerCount(BLOCK_CACHE_DATA_HIT);
auto gets = stats->getTickerCount(NUMBER_KEYS_READ);
double read_amp = (double)block_reads / gets; A read amp near 1.0 means most Gets find the key in the first level checked (or even the block cache). 5-10 means many Gets traverse multiple levels — usually because bloom filter cache is too small or too few bits per key. Above 10, the LSM is misconfigured.
The block cache hit rate matters more than file-level read amp. BLOCK_CACHE_HIT
/ BLOCK_CACHE_HIT + BLOCK_CACHE_MISS tells you how often a block was already in
memory. With a well-sized block cache (8-32 GB for terabyte datasets), 90%+ hit rates are
common, and the disk-level read amp barely matters.
Measuring Space Amplification
Live data vs total disk usage.
db->GetIntProperty("rocksdb.estimate-live-data-size", &live);
db->GetIntProperty("rocksdb.total-sst-files-size", &total);
space_amp = (double)total / live; Steady state on leveled compaction: 1.1-1.4. The 0.1-0.4 overhead is mostly tombstones not yet compacted away and partial L0 files. During a major compaction, total briefly inflates because input files and output files coexist; once the compaction commits, the inputs are removed and the ratio settles back.
Universal compaction shows higher steady-state SA (1.5-2×) and dramatic spikes during major compactions. If you must guarantee tight space, leveled is the right choice; if you want lowest WA and can tolerate higher disk allocation, universal works.
Tuning Each Amp Down
Levers that move each axis, and what they cost on the others.
| Goal | Lever | What it does | What it costs |
|---|---|---|---|
| Lower WA | universal compaction | fewer rewrites of same byte | higher SA spikes, higher RA |
| Lower WA | larger memtable | fewer flushes, more L0 batching | recovery slower, RAM use higher |
| Lower RA | more bloom bits/key | fewer false positives | more RAM for filter cache |
| Lower RA | partitioned filters | only loaded blocks' filters cached | more files; some configs slower |
| Lower SA | smaller multiplier | tombstones reach deep faster | higher WA |
| Lower SA | more compaction threads | compaction keeps up | more CPU, contended I/O |
FAQ
How do I measure write amplification in production?
RocksDB exposes counters in DB::GetProperty('rocksdb.stats') and per-CF in 'rocksdb.cfstats'. Look at 'rocksdb.bytes-written' (logical user writes) versus the sum of bytes read+written in compactions. The ratio gives end-to-end write amp. The compaction stats output shows a per-level table: column 'W-Amp' is per-level write amp; column 'Comp(cnt)' is compactions count. For overall WA: bytes_compacted / bytes_user_written. Healthy WA on a leveled-compaction OLTP workload is 10-30; >50 means levels are too deep, files are too small, or compaction can't keep up.
Why is read amp surprisingly high on a fresh DB?
Right after lots of L0 flushes, you may have 4-12 L0 files all overlapping in key range. Every Get must check every L0 file's bloom filter and possibly its data blocks — that's 4-12 read amplification just from L0. Once compaction settles into L1+, where files have disjoint key ranges, read amp drops to ~levels (one file per level). The bloom filter false positive rate is what determines block reads versus filter rejections; tune bloom_bits_per_key higher for read-heavy workloads.
What does space amplification look like?
Space amp = (size on disk) / (logical data size). Theoretical minimum is 1.0. Leveled compaction in steady state runs ~1.1-1.4: tombstones for deletes haven't compacted yet, plus the size of the level being compacted. Universal compaction can hit 2.0-3.0 during major compactions when input + output coexist. FIFO is bounded by retention, not amp. Watch 'rocksdb.estimate-live-data-size' vs total SST size; the gap is your space amp.
Is write amplification bad for SSDs?
Yes — SSDs have a finite endurance budget (P/E cycles). A WA of 30 means each user-byte causes 30 bytes of NAND program/erase activity. On a 1 PBW (petabyte-written) endurance SSD, WA=30 means you can only write ~33 TB of user data before exhausting the warranty. For RocksDB on consumer SSDs, this matters; for enterprise NVMe with multi-PBW endurance, less so. Tuning to lower WA (see /rocksdb/tuning) extends drive life proportionally.
What's the bloom filter's contribution to read amp reduction?
A bloom filter with 10 bits/key has ~1% false positive rate. For a Get on a non-existent key (the common case for many workloads), the filter rejects on every level except the one or two where false positives occur. On a 6-level LSM, that turns 6 block reads into ~0.06 block reads on average — a 100x read amp reduction. For existing keys, the filter is correct on the level that holds the key but may produce false positives on higher levels; you still pay one block read per false positive level.
Why does compaction sometimes increase write amp temporarily?
When a level is over its target size, RocksDB picks one input file and merges it with all overlapping files in the next level. If many files overlap, that one input file forces rewriting all of them — locally high WA. Over time across many compactions, the average WA per byte stabilizes around log(total_size / memtable_size) × multiplier. Spikes happen during catch-up after a write burst.