RocksDB Amplification

An LSM tree pays for its sequential-write efficiency in three currencies: write amplification (the same bytes get rewritten as compaction merges levels), read amplification (a Get may need to check multiple levels), and space amplification (compaction-pending data and obsolete versions take extra disk). They're tunable and they trade off against each other — every workload's sweet spot is different. This page shows what each amp actually measures, how to read it from RocksDB's stats output, and the levers you have to push each one down at the cost of others.

The Three Amplifications Visualized

all three rise and fall together — a budget you allocate Write Amp bytes written / user bytes flush: 1× (memtable → L0) L0→L1: 1× (one file each) L1→L2: ~10× (10 overlapping) L2→L3: ~10× total: O(log N × multiplier) healthy: 10-30 for OLTP bad: >50 means tune Read Amp block reads / Get memtable: 0 reads each L0 file: 1 filter check + 1 block read on hit L1+: 1 file per level bloom rejects most levels healthy: ~1-2 blocks/Get cache hit reduces further filter caching is critical Space Amp disk size / logical size tombstones not compacted stale versions in L0 compaction inputs+outputs coexist briefly leveled: 1.1-1.4× steady universal: 2-3× during major deletes alone aren't enough

Key Numbers

Healthy WA
10-30
Healthy RA
1-2 blocks
Healthy SA
1.1-1.4×
Bloom default
10 bits/key
FP rate @10 bits
~1%
Universal SA peak
2-3×
FIFO SA
~1.0× hard

The Pareto Frontier

Lower WA → higher RA/SA
Fewer compactions means tombstones and stale versions linger. More files per level means more bloom checks per Get. Universal compaction is the extreme: minimum WA, maximum RA/SA spikes.
Lower RA → higher WA
Aggressive compaction cleans levels fast (low RA, low SA) but rewrites every byte multiple times across levels (high WA). Good for read-heavy workloads on durable disks.
Lower SA → higher WA
Compaction reclaims space by merging away tombstones. Running compaction more aggressively (lower max_bytes_for_level_base, smaller files) means lower SA but more rewrites.

Measuring Write Amplification

Read it directly from RocksDB's stats.

db->GetProperty("rocksdb.stats", &output);

** Compaction Stats [default] **
Level    Files   Size    Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Wr(MB/s) Rd(MB/s) Comp(sec) Comp(cnt) Avg(sec)
  L0      4/0   256MB     1.0     0.0     0.0      0.0      12.5    12.5      0.0   1.0    25.6     0.0      488     1024     0.5
  L1      8/0  1024MB     1.0    34.5    11.5     23.0      34.5     0.0      0.0   3.0    52.1    52.1      681      256     2.7
  L2     65/0  10240MB    0.9    98.7    33.2     65.5      98.7     0.0      0.0  10.5    61.0    61.0     1660       64    25.9
  Sum   77/0  11520MB         133.2    44.7     88.5     145.7    12.5      0.0   ...
   Int   0/0      0KB         0.0     0.0      0.0       0.0     0.0      0.0   ...

Cumulative writes: 12.5 GB user + 145.7 GB compaction = 158.2 GB total
Per-byte WA: 158.2 / 12.5 = 12.66×

The Sum row gives total bytes written. Wnew on L0 is the bytes from memtable flushes (logical writes); the rest is compaction. WA = Write(total) / Wnew(L0). In this example, ~12.7× — well within healthy range for leveled compaction.

Per-level W-Amp shows where the cost lives. L2's 10.5× means each byte arriving at L2 was rewritten 10.5 times during L2 compaction (it's overlapping with ~10 files in L3 each compaction). Deep levels contribute most of the WA budget; if your data is mostly in L2-L4, that's where tuning helps.

Measuring Read Amplification

Block reads per Get, observable via tickers.

// Enable statistics collection
options.statistics = rocksdb::CreateDBStatistics();

// After workload runs:
auto stats = options.statistics;
auto block_reads = stats->getTickerCount(BLOCK_CACHE_DATA_MISS) +
                   stats->getTickerCount(BLOCK_CACHE_DATA_HIT);
auto gets = stats->getTickerCount(NUMBER_KEYS_READ);
double read_amp = (double)block_reads / gets;

A read amp near 1.0 means most Gets find the key in the first level checked (or even the block cache). 5-10 means many Gets traverse multiple levels — usually because bloom filter cache is too small or too few bits per key. Above 10, the LSM is misconfigured.

The block cache hit rate matters more than file-level read amp. BLOCK_CACHE_HIT / BLOCK_CACHE_HIT + BLOCK_CACHE_MISS tells you how often a block was already in memory. With a well-sized block cache (8-32 GB for terabyte datasets), 90%+ hit rates are common, and the disk-level read amp barely matters.

Measuring Space Amplification

Live data vs total disk usage.

db->GetIntProperty("rocksdb.estimate-live-data-size", &live);
db->GetIntProperty("rocksdb.total-sst-files-size", &total);

space_amp = (double)total / live;

Steady state on leveled compaction: 1.1-1.4. The 0.1-0.4 overhead is mostly tombstones not yet compacted away and partial L0 files. During a major compaction, total briefly inflates because input files and output files coexist; once the compaction commits, the inputs are removed and the ratio settles back.

Universal compaction shows higher steady-state SA (1.5-2×) and dramatic spikes during major compactions. If you must guarantee tight space, leveled is the right choice; if you want lowest WA and can tolerate higher disk allocation, universal works.

Tuning Each Amp Down

Levers that move each axis, and what they cost on the others.

GoalLeverWhat it doesWhat it costs
Lower WAuniversal compactionfewer rewrites of same bytehigher SA spikes, higher RA
Lower WAlarger memtablefewer flushes, more L0 batchingrecovery slower, RAM use higher
Lower RAmore bloom bits/keyfewer false positivesmore RAM for filter cache
Lower RApartitioned filtersonly loaded blocks' filters cachedmore files; some configs slower
Lower SAsmaller multipliertombstones reach deep fasterhigher WA
Lower SAmore compaction threadscompaction keeps upmore CPU, contended I/O

FAQ

How do I measure write amplification in production?

RocksDB exposes counters in DB::GetProperty('rocksdb.stats') and per-CF in 'rocksdb.cfstats'. Look at 'rocksdb.bytes-written' (logical user writes) versus the sum of bytes read+written in compactions. The ratio gives end-to-end write amp. The compaction stats output shows a per-level table: column 'W-Amp' is per-level write amp; column 'Comp(cnt)' is compactions count. For overall WA: bytes_compacted / bytes_user_written. Healthy WA on a leveled-compaction OLTP workload is 10-30; >50 means levels are too deep, files are too small, or compaction can't keep up.

Why is read amp surprisingly high on a fresh DB?

Right after lots of L0 flushes, you may have 4-12 L0 files all overlapping in key range. Every Get must check every L0 file's bloom filter and possibly its data blocks — that's 4-12 read amplification just from L0. Once compaction settles into L1+, where files have disjoint key ranges, read amp drops to ~levels (one file per level). The bloom filter false positive rate is what determines block reads versus filter rejections; tune bloom_bits_per_key higher for read-heavy workloads.

What does space amplification look like?

Space amp = (size on disk) / (logical data size). Theoretical minimum is 1.0. Leveled compaction in steady state runs ~1.1-1.4: tombstones for deletes haven't compacted yet, plus the size of the level being compacted. Universal compaction can hit 2.0-3.0 during major compactions when input + output coexist. FIFO is bounded by retention, not amp. Watch 'rocksdb.estimate-live-data-size' vs total SST size; the gap is your space amp.

Is write amplification bad for SSDs?

Yes — SSDs have a finite endurance budget (P/E cycles). A WA of 30 means each user-byte causes 30 bytes of NAND program/erase activity. On a 1 PBW (petabyte-written) endurance SSD, WA=30 means you can only write ~33 TB of user data before exhausting the warranty. For RocksDB on consumer SSDs, this matters; for enterprise NVMe with multi-PBW endurance, less so. Tuning to lower WA (see /rocksdb/tuning) extends drive life proportionally.

What's the bloom filter's contribution to read amp reduction?

A bloom filter with 10 bits/key has ~1% false positive rate. For a Get on a non-existent key (the common case for many workloads), the filter rejects on every level except the one or two where false positives occur. On a 6-level LSM, that turns 6 block reads into ~0.06 block reads on average — a 100x read amp reduction. For existing keys, the filter is correct on the level that holds the key but may produce false positives on higher levels; you still pay one block read per false positive level.

Why does compaction sometimes increase write amp temporarily?

When a level is over its target size, RocksDB picks one input file and merges it with all overlapping files in the next level. If many files overlap, that one input file forces rewriting all of them — locally high WA. Over time across many compactions, the average WA per byte stabilizes around log(total_size / memtable_size) × multiplier. Spikes happen during catch-up after a write burst.