🗂️ LSM-Tree & Compaction
The Write-Optimized Storage Engine Behind TiKV
📝 MemTable (RAM)
Sorted in-memory bufferL0
Unsorted SSTs (fast writes)L1
Sorted, non-overlappingL2
10x larger, sorted📜 Activity Log
💡 Why LSM-Trees?
Traditional B-trees update data in-place, requiring random disk I/O for every write. LSM-trees batch writes in memory, then flush sequentially — turning random writes into sequential writes. This makes writes 10-100x faster on SSDs and HDDs.
⚖️ The Write Amplification Tradeoff
The catch: data gets rewritten multiple times as it compacts through levels. A single key might be written to L0, then compacted to L1, then to L2... Write amplification is the ratio of bytes written to disk vs. bytes written by the app. Typical values: 10-30x for RocksDB.
🔍 Read Amplification
To read a key, we might check: MemTable → L0 (multiple SSTs!) → L1 → L2. Bloom filters help skip SSTs that definitely don't contain the key. In the worst case, reads check every level — this is read amplification.
🎛️ Tuning Knobs
MemTable Size
Larger = fewer flushes, but more memory and longer recovery
write-buffer-size = 128MB L0 Trigger
Compact to L1 when L0 has this many SSTs
level0-file-num-compaction-trigger = 4 Level Ratio
Each level is Nx larger than the previous
max-bytes-for-level-multiplier = 10 Compression
Balance CPU vs disk space per level
compression-per-level = [no, no, lz4, lz4, lz4, zstd, zstd]