RocksDB Bloom Filters
A bloom filter is a probabilistic data structure that answers "does this set contain X?" with
either "definitely no" or "probably yes" — never a false negative. RocksDB attaches a bloom
filter to every SST file so that Get can skip reading data blocks whenever the
filter rejects. The default 10 bits/key gives ~1% false positive rate, turning a 6-level
LSM read amplification of 6 block reads into roughly 0.06 expected block reads on the
common-case "key not present at this level." This page walks the math, the variants
(full, partitioned, ribbon, prefix), and the configuration knobs that decide how much RAM
and CPU you spend for how few false positives.
Bloom Filter Lookup
Key Numbers
Why Bloom Filters Are the Single Biggest LSM Read Win
The Math: Bits Per Key vs FP Rate
More bits = exponentially fewer false positives.
FP_rate ≈ (1 - e^(-k·n/m))^k where: n = number of keys in the filter m = total bits k = number of hash functions optimal k = (m/n) · ln(2)
For RocksDB's default 10 bits/key, optimal k = 10·ln(2) ≈ 6.93, which it rounds to 7. The false positive rate at this point is the well-known table:
bits/key FP rate
8 2.16%
10 0.82% ← RocksDB default
12 0.31%
14 0.12%
16 0.046%
20 0.0067% Each additional 4 bits/key drops FP by an order of magnitude. Practical sweet spot: 10-14 bits depending on workload. For mostly-misses workloads (caching layer in front of a slow backend), 14-16 bits earns its memory back many times over in avoided block reads.
Full Filters: One Block Per SST
The classic and default form.
A full filter is a single bit array covering all keys in the SST. It lives in the SST's filter block, accessed via the footer metadata. Lookup procedure:
1. read SST footer (last 53 bytes) — typically cached 2. seek to filter block offset 3. read filter block — typically cached after first hit 4. compute 7 hash positions of the queried key 5. check all 7 bits in the filter
The filter block is loaded as one unit. For an SST containing 1M keys at 10 bits/key, the
filter is 1.25 MB. Cached, this is fast; uncached, it's a ~1ms disk read on rotational media
or ~50µs on NVMe. Hence the importance of cache_index_and_filter_blocks=true:
put filter blocks in the block cache so they stay hot.
Partitioned Filters: Index-Aligned Sub-Filters
Split the filter so only the relevant partition is loaded per query.
For a large SST (multi-GB), the full filter can be 10s of MB. Loading all of it for a single Get wastes cache. Partitioned filters split the filter into many small partitions, each covering a key-range slice that aligns with the SST's index structure.
SST file: +----------------+ | data blocks | +----------------+ | index L1 | ← top-level index +----------------+ | index L2 (×N) | ← per-partition indexes +----------------+ | filter L1 | ← top-level filter index (small) +----------------+ | filter L2 (×N) | ← per-partition filters (load only what's needed) +----------------+ | footer | +----------------+
To Get a key from a partitioned-filter SST: read the top-level index to find which partition could contain the key, then read just that partition's filter block. RAM use per Get drops from full-filter-size to one-partition-size — typically 100-1000x. Recommended for large SSTs (target_file_size_base ≥ 256 MB) or big LSMs where filter cache pressure matters.
Enable with BlockBasedTableOptions::partition_filters = true and
BlockBasedTableOptions::index_type = kTwoLevelIndexSearch.
Ribbon Filters (RocksDB 6.20+)
Same FP rate, ~30% fewer bits, slower to construct.
Ribbon filters solve a sparse linear system over GF(2) at construction time — for each key, a row of the system constrains certain output bits to specific values. Lookup checks whether the queried key's row is satisfied by the stored solution. The math is described in the 2021 paper "Ribbon Filter: Practically Smaller than Bloom and Xor."
Practical numbers from RocksDB benchmarks:
type bits/key FP rate build time query time bloom 10 0.82% 1.0× 1.0× ribbon 7 0.82% 2.0× 0.7× ribbon 8 0.27% 2.0× 0.7×
Ribbon shines when SSTs are queried much more than they're built — typical for large
static datasets. For write-heavy workloads where SSTs are short-lived and constantly
compacted, the 2x build cost matters more. RocksDB lets you choose per CF:
NewRibbonFilterPolicy(bits_per_key).
Prefix Bloom Filters
For workloads dominated by prefix scans.
A standard bloom hashes the full key. A prefix bloom hashes only a transform of the key — for
example, the first N bytes, or everything before the first ':' separator. Set
prefix_extractor on the column family options. Now a Seek("user_42:") to start
a prefix iteration can use the bloom to skip SSTs that don't contain any user_42: keys.
cf_options.prefix_extractor.reset(NewCappedPrefixTransform(8));
// keys are like:
"user_42:profile" → prefix "user_42:" → hashed
"user_42:feed" → prefix "user_42:" → same hash (already in bloom)
"user_99:profile" → prefix "user_99:" → different hash
// Seek("user_42:") can ask the bloom: "is prefix user_42: in this SST?"
// only loads SSTs whose bloom says yes Prefix bloom is a strict win for prefix-iterator workloads (one bloom hit per prefix, not per key). For point Get on full keys, you can still benefit if your queries happen to share a common prefix structure. The cost: more keys hash to the same bloom slot if many keys share a prefix, slightly raising FP rate.
FAQ
What's the actual false positive rate at 10 bits/key?
About 1% for a standard bloom filter. The math is FP_rate ≈ (1 - e^(-k·n/m))^k where m=bits, n=keys, k=hash count. At 10 bits/key with k=7 (RocksDB default), FP rate is ~0.82%. Doubling to 20 bits/key drops it to ~0.0014%. The marginal cost of more bits is linear in RAM; marginal gain in FP rate is exponential — so going from 10 to 14 bits is often a great trade. Going beyond 20 bits is rarely worth it; you're already three nines down.
What does a partitioned filter buy me?
Memory efficiency. Without partitioning, a single SST's bloom filter is one big block — to use it, you load the whole filter (potentially several MB for a large SST) into the block cache. Partitioned filters split the filter into many small blocks aligned with the SST's index; only the partition needed for the queried range is loaded. For a Get on a large LSM, partitioned filters give the same FP rate while loading 10-100x less RAM per query. Set partition_filters=true and cache_index_and_filter_blocks=true.
What are ribbon filters?
A 2021 alternative to bloom that achieves the same FP rate with ~30% fewer bits. Ribbon filters solve a sparse linear system at construction time to encode the membership; lookup is faster too because there are fewer hash functions. The cost: construction is heavier (build time per SST is ~2x bloom). For static SSTs that get queried often, the trade is great; for write-heavy workloads where SSTs are short-lived, the construction cost matters. Available since RocksDB 6.20 — set NewRibbonFilterPolicy() instead of NewBloomFilterPolicy().
When is a prefix bloom useful?
When your queries are prefix-iterators rather than point Gets. A normal bloom hashes the full key; a Seek('user_42:') benefits nothing because no full key 'user_42:' exists in the SST. A prefix bloom hashes a transform of the key (e.g. up to the first colon), letting prefix scans skip whole SSTs that have no matching prefix. Set prefix_extractor on the column family options. Common with key designs where many keys share a meaningful prefix (user_id, partition, tenant).
Should I cache filter blocks separately from data blocks?
Yes — cache_index_and_filter_blocks=true puts them in the block cache. The default has changed over time; in modern RocksDB it's recommended on. The reason: filters are small and frequently accessed. A filter block read from disk on every Get adds significant latency. Putting filters in the block cache makes them effectively memory-resident for hot SSTs. You may want pin_l0_filter_and_index_blocks_in_cache=true to never evict L0 filters since L0 is checked on every Get.
Why do filters live in SST footer metadata?
An SST file is a self-contained unit of persistent storage. The filter is built when the SST is written (during memtable flush or compaction output) and embedded in the file alongside the data and index. This makes the SST atomic: open the file, read the footer, find offsets to data, index, and filter blocks. There's no separate filter-store to keep in sync. When the SST is deleted (post-compaction), its filter goes with it — no orphaned state.