RocksDB Column Families
A Column Family in RocksDB is an independently-configured key-value namespace within a single database. Each CF has its own memtable, SST files, bloom filter, compaction style, and tuning options — but they all share the WAL, so writes across multiple CFs commit atomically with a single fsync. This makes CFs the natural way to separate hot data from cold, transactional state from indexes, or short-TTL streams from long-lived blobs while preserving cross-CF transactional semantics. This page walks through the architecture, the knobs, and the patterns that real users (TiKV, CockroachDB, MyRocks) build on top.
Column Family Architecture
Key Numbers
Why Column Families Exist
Creating and Using Column Families
The C++ API; bindings in Rust, Java, Go follow the same shape.
// Open with explicit CF list
std::vector<ColumnFamilyDescriptor> cfs;
cfs.push_back(ColumnFamilyDescriptor("default", default_opts));
cfs.push_back(ColumnFamilyDescriptor("indexes", index_opts));
cfs.push_back(ColumnFamilyDescriptor("events", events_opts));
std::vector<ColumnFamilyHandle*> handles;
DB::Open(db_opts, "/path/to/db", cfs, &handles, &db);
// Write to a specific CF
db->Put(WriteOptions(), handles[1], key, value);
// Atomic write across CFs
WriteBatch batch;
batch.Put(handles[0], "user_42", user_data);
batch.Put(handles[1], "name_alice", "user_42");
db->Write(WriteOptions(), &batch); // both atomic via WAL
The handles array maps each CF to a numeric ID; the WAL records writes with the CF ID so
that on recovery, replaying entries dispatches them to the correct CF. Get,
Delete, Iterator all take a CF handle as their first argument.
The Shared WAL
One log, many memtables.
Every Put/Delete is appended to the WAL with a (CF_id, key, value, sequence_number) tuple.
Sequence numbers are monotonic across all CFs — there is one global sequence space.
WriteBatch produces a contiguous range of sequence numbers atomic at fsync time.
A consequence: WAL retention is bounded by the slowest-flushing CF. The WAL keeps every
entry until all CFs whose memtables contain that entry's data have flushed.
If CF A flushes every 10s but CF B every 10min, the WAL holds 10min of data — sized to B's
cadence. To bound WAL size, set max_total_wal_size (default unlimited); when
exceeded, RocksDB triggers flushes on whichever CFs are holding the oldest WAL entries.
Per-CF Options
Every LSM tuning knob is per-CF.
ColumnFamilyOptions cf_opts;
// Memory
cf_opts.write_buffer_size = 64 << 20; // 64 MB memtable
cf_opts.max_write_buffer_number = 4; // 4 memtables max in pipeline
cf_opts.min_write_buffer_number_to_merge = 2;
// Levels
cf_opts.compaction_style = kCompactionStyleLevel;
cf_opts.num_levels = 7;
cf_opts.target_file_size_base = 64 << 20;
cf_opts.max_bytes_for_level_base = 256 << 20;
cf_opts.max_bytes_for_level_multiplier = 10;
cf_opts.level0_file_num_compaction_trigger = 4;
// Bloom filter and table format
BlockBasedTableOptions table_opts;
table_opts.filter_policy.reset(NewBloomFilterPolicy(10, false));
table_opts.block_size = 4096;
table_opts.cache_index_and_filter_blocks = true;
cf_opts.table_factory.reset(NewBlockBasedTableFactory(table_opts));
// Compression
cf_opts.compression_per_level = {
kNoCompression, // L0 fast
kSnappyCompression,
kSnappyCompression,
kZSTDCompression, // deeper levels: better compression
kZSTDCompression,
kZSTDCompression,
kZSTDCompression
};
The per-level compression_per_level is particularly nice for CFs: hot levels
stay uncompressed for low-latency reads; cold levels use ZSTD for storage density. And
because CFs have independent levels, two CFs in the same DB can have completely different
compression strategies.
Atomic Flush
All-or-nothing flush across CFs to keep on-disk state consistent.
Without atomic flush, each CF's memtable flushes independently. If a transaction touched CFs A and B, and A flushes before a crash but B does not, on recovery you'll replay B from the WAL — but A's data is already on disk. The DB ends up consistent (sequence numbers ensure it) but during the flush window, the on-disk state shows A's writes and not B's.
For systems that read SST files directly (incremental backups, snapshot exports), this
asymmetry can leak inconsistency. DBOptions::atomic_flush = true forces all
CFs to flush together: when any CF's memtable hits its limit, all CFs flush in one
synchronized step. Files written are visible only after every CF's new SST is committed to
MANIFEST.
Common Patterns
How real systems use CFs.
TiKV uses three CFs per Raft region: default for user-visible
data, lock for transactional row locks (small, hot, expires fast),
write for MVCC commit records. The lock CF uses small memtables and aggressive
compaction (locks must be cleaned up quickly); default CF uses balanced settings; write CF
can tolerate higher read amp because reads of historical commit records are rare.
MyRocks (MySQL on RocksDB) uses one CF per index. Each index has its own bloom sizing — primary keys may have 16 bits/key for fast point lookups while secondary indexes accept 10. Per-CF compaction priorities ensure hot indexes get compaction CPU first.
TTL data + permanent state: a common pattern is a default CF with leveled compaction for permanent state, plus one or more FIFO-compaction CFs for time-windowed streams. The FIFO CFs auto-drop old data without compaction overhead — the equivalent of Redis's MAXLEN at the LSM level.
FAQ
Why use column families instead of just multiple databases?
Column families share the WAL — a write that touches multiple CFs is atomic across all of them with one fsync. Multiple separate databases would each have their own WAL and you'd lose cross-database atomicity, plus you'd pay multiple fsyncs per write. Use CFs when you need different storage profiles (TTL data, indexes, hot keys vs cold blobs) within one logical database. Use separate databases when you genuinely have isolated state.
Do CFs share memtable memory?
Each CF has its own memtable with its own size (write_buffer_size). The total memtable memory budget is the sum across all CFs unless you set a global write_buffer_manager that pools the budget. Without the manager, an idle CF still takes up its full memtable size. With the manager, RAM is shared and idle CFs cost nothing. For deployments with many CFs (TiKV uses ~3 per region; CockroachDB uses several), the write_buffer_manager is essential.
What's atomic flush?
If a transaction touches CFs A, B, and C, you want either all three CFs' new data flushed together or none — to keep their on-disk state consistent. atomic_flush=true forces RocksDB to flush all CFs together when any of them hits its memtable limit. Without atomic flush, a crash mid-flush could leave A's data in an SST while B's is still in WAL, causing on-disk inconsistency. Required if you have logical relationships across CFs (TiKV uses it; CockroachDB does too).
Can each CF have its own compaction style?
Yes. Per-CF options.compaction_style lets you mix leveled and universal compaction in the same DB. Useful pattern: a hot-data CF with universal compaction (low write amp), and a cold/secondary-index CF with leveled (low space amp). Per-CF target_file_size_base and write_buffer_size let you size each independently — small files for the hot CF (fast compactions, good for fresh data) and large for the cold (better compression, less metadata overhead).
How do CFs interact with snapshots?
A snapshot is database-wide — it captures a sequence number across all CFs. Reading from any CF at that snapshot gives you the state at that moment. This is what makes cross-CF consistent reads possible: a transaction's snapshot pins the same sequence number for all the CFs it might touch. The downside: a long-held snapshot prevents compaction across all CFs, even ones the transaction never touches, because compaction can't drop tombstones older than any live snapshot.
Real-world examples?
TiKV uses 3 CFs per region: 'default' for user data, 'lock' for transactional locks, 'write' for MVCC commit records. CockroachDB used multiple CFs in earlier versions for similar separation. Facebook's MyRocks uses one CF per index. Each CF gets its own bloom filter sizing, block cache, and compaction profile — the indexes can have aggressive read tuning while user data uses balanced tuning.