RocksDB Tuning

RocksDB exposes hundreds of options and pretends each is independently tunable. In practice a small handful of knobs determine whether your deployment behaves well: block cache size, memtable size and count, level configuration, compression strategy, and the rate limiter. This page walks through each in the order you'd actually adjust them, gives the production-typical values, and explains why the major distributed systems built on RocksDB (TiKV, MyRocks, Yugabyte) tune the way they do — and why CockroachDB and others have moved to forks like Pebble.

The Memory Hierarchy You're Tuning

Key Numbers (production-typical)

Block cache

25-50% RAM

Memtable

128-512 MB

Max memtables

4-8

SST file

64-256 MB

L0 trigger

4-8 files

Multiplier

10x

Compaction threads

4-8

The Tuning Triangle

Read latency

Big block cache, generous bloom bits, partitioned filters, direct I/O, more compaction (smaller levels, fewer overlapping files). Costs CPU and RAM.

Write throughput

Big memtables, many memtables in pipeline, deep levels, large SST files, fewer compactions, no fsync. Costs RAM and crash-window data.

Disk longevity

Low write amp via universal compaction, large memtables, deep levels. Costs read amp and space amp. SSD-life-extending tuning often hurts read latency.

Block Cache: First and Most Important Knob

Where decompressed blocks live. Determines read latency.

BlockBasedTableOptions table_opts;
table_opts.block_cache = NewLRUCache(24L << 30);  // 24 GB
table_opts.cache_index_and_filter_blocks = true;
table_opts.pin_l0_filter_and_index_blocks_in_cache = true;
table_opts.cache_index_and_filter_blocks_with_high_priority = true;
options.table_factory.reset(NewBlockBasedTableFactory(table_opts));

cache_index_and_filter_blocks puts the SST's index and bloom filter blocks into the cache (rather than always loading them on file open). Without it, every SST open reads index+filter from disk; with it, they stay hot. The "high priority" variant ensures they're not evicted by streaming data block churn.

pin_l0_filter_and_index_blocks_in_cache never evicts L0 filter/index blocks. Since L0 is checked on every Get, this prevents L0 filter loading from latency spikes.

Sizing: start at 25% of RAM. Watch the BLOCK_CACHE_HIT_RATIO ticker. If <90% on a read-heavy workload, increase. Above 95% is diminishing returns. Memtable + block cache + OS overhead should sum to ≤80% of RAM, leaving page cache headroom (or use direct I/O and claim more for the block cache).

Memtable and Write Buffers

Big memtables = fewer flushes = lower WA. Up to a point.

cf_opts.write_buffer_size = 256 << 20;        // 256 MB per memtable
cf_opts.max_write_buffer_number = 4;            // 4 memtables max
cf_opts.min_write_buffer_number_to_merge = 2;   // merge 2 before flush

// global RAM budget across all CFs
db_opts.write_buffer_manager.reset(new WriteBufferManager(2 << 30));  // 2 GB pool

Each CF's memtable holds writes until it fills write_buffer_size. Then it becomes immutable and a new memtable starts; the old one is flushed asynchronously to L0. Larger memtables mean (a) fewer flushes, (b) each flush produces a bigger L0 SST, (c) better duplicate-key elimination at flush time (writes to the same key are coalesced before disk).

max_write_buffer_number caps how many memtables can be pending flush. Hit it and writes block. min_write_buffer_number_to_merge can be raised to 2 to coalesce two memtables in one flush — saves L0 file count.

The WriteBufferManager pools memtable budget across CFs. Without it, an idle CF still holds onto its full memtable size. With it, idle CFs cost nothing. For deployments with many CFs, always use the manager.

Levels and Multipliers

Where the LSM tradeoff is set.

cf_opts.compaction_style = kCompactionStyleLevel;
cf_opts.num_levels = 7;
cf_opts.max_bytes_for_level_base = 256 << 20;        // L1 = 256 MB
cf_opts.max_bytes_for_level_multiplier = 10;          // L2 = 2.5 GB, etc.
cf_opts.target_file_size_base = 64 << 20;             // 64 MB SSTs
cf_opts.target_file_size_multiplier = 1;              // same size all levels

cf_opts.level0_file_num_compaction_trigger = 4;
cf_opts.level0_slowdown_writes_trigger = 20;
cf_opts.level0_stop_writes_trigger = 36;

The 10x multiplier is well-tested. Lowering to 5x reduces space amp (tombstones reach bottom faster) at cost of higher write amp. Raising to 20x reduces write amp at cost of more levels.

L0 stall triggers protect against cascade failure: if compaction can't keep up, slowdown kicks in at 20 L0 files and stops writes at 36. If you see writes stalling at L0 stop in production, your compaction can't keep up — increase parallel compaction threads or move to faster disk.

Compaction Threads and Rate Limiter

Throughput on the maintenance path; impact on the foreground.

db_opts.max_background_jobs = 8;        // total bg threads (flush + compact)
db_opts.max_subcompactions = 4;          // intra-compaction parallelism

// Rate limit compaction I/O so it doesn't starve user reads
db_opts.rate_limiter = NewGenericRateLimiter(
  80L * 1024 * 1024,    // 80 MB/s sustained
  100 * 1000,           // 100ms refill interval
  10                    // bonus burst factor
);

max_background_jobs is the total compaction+flush thread count. Set to core_count for compaction-heavy workloads. max_subcompactions lets a single compaction job split into chunks (parallel writers within one compaction). Useful for large compactions on multi-core hosts.

The rate limiter caps compaction's I/O bandwidth. Without it, a major compaction can saturate the disk and your p99 read latency explodes. Set it to ~80% of disk's sustained bandwidth. The bonus burst factor allows brief spikes for small compactions.

Compression Strategy

Different levels, different tradeoffs.

cf_opts.compression_per_level = {
  kNoCompression,        // L0: latency-sensitive
  kLZ4Compression,       // L1: balanced
  kLZ4Compression,       // L2
  kZSTDCompression,      // L3+: storage density
  kZSTDCompression,
  kZSTDCompression,
  kZSTDCompression
};

cf_opts.bottommost_compression = kZSTDCompression;
cf_opts.bottommost_compression_opts.level = 6;    // higher ZSTD level for biggest level

L0 is the hot tier: every Get checks all L0 files, and the data is recently written. Skip compression there — saves CPU on the hot path, costs little disk because L0 is small. L1-L2 use LZ4 (fast decompression, modest compression). L3+ use ZSTD; the deepest level uses ZSTD level 6 for maximum density since data here rarely gets read.

For datasets that compress well (JSON, text, low-cardinality columns), this strategy can cut disk usage 3-5x with minimal latency impact. For binary or already-compressed data, compression is a small loss — consider kNoCompression everywhere if you've verified your data doesn't compress.

Why TiKV Uses RocksDB and CockroachDB Switched to Pebble

Forks exist because the same kernel doesn't fit every host language.

TiKV (the Rust-based KV layer of TiDB) calls into RocksDB via Rust FFI. The C++/Rust boundary is well-tuned and FFI overhead is minimal. TiKV uses RocksDB's WritePrepared mode for percolator transactions, three CFs per region, and aggressive tuning for write-heavy OLTP workloads. They contribute back to RocksDB rather than fork.

CockroachDB originally used RocksDB via Cgo, but Cgo's call overhead (~100ns per crossing) hurt them on hot paths. In 2018 they began Pebble — a from-scratch Go reimplementation of RocksDB's LSM. Pebble is now their default. The advantages: zero Cgo overhead, better integration with Go's runtime (concurrency, profiling, GC), and the freedom to add Cockroach-specific features (their MVCC layer, SSTable format extensions). The disadvantage: parallel maintenance burden — RocksDB and Pebble are both moving targets, and porting features between them is real work.

YugabyteDB uses RocksDB from C++; Cassandra's row cache uses Cassandra's own SSTable code (not RocksDB); Kafka uses RocksDB for some materialized state. The pattern: RocksDB is the LSM kernel of choice for non-Go systems; Go projects increasingly use Pebble.

FAQ

How big should the block cache be?

Rule of thumb: 25-50% of available RAM, leaving the rest for OS page cache and memtables. The block cache holds decompressed data blocks, index blocks, and filter blocks (when cache_index_and_filter_blocks=true). For workloads where reads dominate, lean toward 50%; for write-heavy workloads, 25% is enough since memtables and compaction will use the rest. Watch the BLOCK_CACHE_HIT/MISS ratio — aim for >90% on read-heavy workloads.

Should I use direct I/O or buffered I/O?

Direct I/O (use_direct_reads=true, use_direct_io_for_flush_and_compaction=true) bypasses the OS page cache, eliminating double-caching since RocksDB has its own block cache. Recommended for production. The downside: you lose the kernel's read-ahead heuristics and the page cache as a backup buffer. For most setups with proper block cache sizing, direct I/O is a net win — especially on systems with many other databases sharing the host.

What's the rate_limiter for?

Caps compaction and flush I/O bandwidth so they don't starve foreground reads/writes. Without it, a major compaction can saturate the disk and tank user latency. Set it to 80-90% of disk's sustained bandwidth: NewGenericRateLimiter(80 * 1024 * 1024) for 80 MB/s. The rate limiter also decouples compaction throughput from disk's IOPS budget — important on shared cloud disks where neighbor noise affects throughput.

Why do TiKV and CockroachDB use Pebble instead of RocksDB?

Pebble is a from-scratch Go reimplementation of RocksDB's LSM internals, written by Cockroach Labs starting in 2018. Reasons: (1) Cgo overhead — every call into C++ from Go costs ~100ns and doesn't optimize well in Go's runtime. Pebble eliminates this for Go-native users. (2) Fine-grained control — being in-process Go lets them integrate observability and concurrency primitives more tightly. (3) Targeted features — Pebble adds things like value-blocks (BlobDB-like), better range deletes, and Cockroach-specific MVCC awareness. RocksDB remains the choice for C++ projects and TiKV (Rust + RocksDB Cgo bindings work fine).

How do I tune for low write amplification?

(1) Increase memtable size — write_buffer_size 256MB-1GB delays flushes, batching more writes. (2) Reduce levels — fewer levels = fewer rewrites; on smaller datasets, 4 levels can suffice. (3) Switch to universal compaction if space amp permits. (4) Increase target_file_size_base — bigger SST files mean fewer compactions. Tradeoffs: bigger memtables risk more data loss on crash if no WAL sync; fewer levels increase read amp; bigger files slow individual compactions.

Should I compress everything?

No. Compression on L0 fights the tight latency budget there — leave L0 as kNoCompression. From L1 onward, Snappy is fast and cheap; from L3+, ZSTD gives 2-3x better compression at higher CPU cost. Use compression_per_level=[None, Snappy, Snappy, ZSTD, ZSTD, ZSTD, ZSTD] as a sane default. Workloads with very compressible data (text, JSON) benefit most; binary data or already-compressed blobs barely compress.