RocksDB Column Families

A Column Family in RocksDB is an independently-configured key-value namespace within a single database. Each CF has its own memtable, SST files, bloom filter, compaction style, and tuning options — but they all share the WAL, so writes across multiple CFs commit atomically with a single fsync. This makes CFs the natural way to separate hot data from cold, transactional state from indexes, or short-TTL streams from long-lived blobs while preserving cross-CF transactional semantics. This page walks through the architecture, the knobs, and the patterns that real users (TiKV, CockroachDB, MyRocks) build on top.

Column Family Architecture

all CFs share the WAL — independent on-disk state Shared WAL — sequence numbers monotonic across CFs, single fsync per commit CF "default" user data — leveled compaction memtable 64 MB L0–L4 SSTs, 10× multiplier bloom 10 bits/key balanced read/write CF "indexes" secondary indexes — read-tuned memtable 32 MB leveled, tighter levels bloom 16 bits/key read-heavy, low FP CF "events" TTL stream — FIFO compaction memtable 256 MB FIFO, drop oldest at 100GB no bloom (always full scan) sequential write only

Key Numbers

Default CF count
1 ('default')
Max CFs
~thousands ok
Per-CF memtable
independent
Shared WAL
all CFs
Atomic flush
opt-in
Per-CF bloom
independent
Snapshot scope
DB-wide

Why Column Families Exist

Heterogeneous workloads in one DB
Real systems have multiple data shapes: small KVs, large blobs, time-series, secondary indexes. Each wants different tuning. CFs let one DB host all of them with appropriate optimization per shape.
Atomic cross-CF writes
A transaction that updates user-data plus its index needs both updates atomic. Multiple databases would lose this. CFs give it via the shared WAL.
Independent compaction
A heavy compaction on the cold-blob CF doesn't block reads on the hot-data CF. Each CF's compactions queue separately.

Creating and Using Column Families

The C++ API; bindings in Rust, Java, Go follow the same shape.

// Open with explicit CF list
std::vector<ColumnFamilyDescriptor> cfs;
cfs.push_back(ColumnFamilyDescriptor("default", default_opts));
cfs.push_back(ColumnFamilyDescriptor("indexes", index_opts));
cfs.push_back(ColumnFamilyDescriptor("events", events_opts));

std::vector<ColumnFamilyHandle*> handles;
DB::Open(db_opts, "/path/to/db", cfs, &handles, &db);

// Write to a specific CF
db->Put(WriteOptions(), handles[1], key, value);

// Atomic write across CFs
WriteBatch batch;
batch.Put(handles[0], "user_42", user_data);
batch.Put(handles[1], "name_alice", "user_42");
db->Write(WriteOptions(), &batch);   // both atomic via WAL

The handles array maps each CF to a numeric ID; the WAL records writes with the CF ID so that on recovery, replaying entries dispatches them to the correct CF. Get, Delete, Iterator all take a CF handle as their first argument.

The Shared WAL

One log, many memtables.

Every Put/Delete is appended to the WAL with a (CF_id, key, value, sequence_number) tuple. Sequence numbers are monotonic across all CFs — there is one global sequence space. WriteBatch produces a contiguous range of sequence numbers atomic at fsync time.

A consequence: WAL retention is bounded by the slowest-flushing CF. The WAL keeps every entry until all CFs whose memtables contain that entry's data have flushed. If CF A flushes every 10s but CF B every 10min, the WAL holds 10min of data — sized to B's cadence. To bound WAL size, set max_total_wal_size (default unlimited); when exceeded, RocksDB triggers flushes on whichever CFs are holding the oldest WAL entries.

Per-CF Options

Every LSM tuning knob is per-CF.

ColumnFamilyOptions cf_opts;

// Memory
cf_opts.write_buffer_size = 64 << 20;     // 64 MB memtable
cf_opts.max_write_buffer_number = 4;       // 4 memtables max in pipeline
cf_opts.min_write_buffer_number_to_merge = 2;

// Levels
cf_opts.compaction_style = kCompactionStyleLevel;
cf_opts.num_levels = 7;
cf_opts.target_file_size_base = 64 << 20;
cf_opts.max_bytes_for_level_base = 256 << 20;
cf_opts.max_bytes_for_level_multiplier = 10;
cf_opts.level0_file_num_compaction_trigger = 4;

// Bloom filter and table format
BlockBasedTableOptions table_opts;
table_opts.filter_policy.reset(NewBloomFilterPolicy(10, false));
table_opts.block_size = 4096;
table_opts.cache_index_and_filter_blocks = true;
cf_opts.table_factory.reset(NewBlockBasedTableFactory(table_opts));

// Compression
cf_opts.compression_per_level = {
  kNoCompression,    // L0 fast
  kSnappyCompression,
  kSnappyCompression,
  kZSTDCompression,  // deeper levels: better compression
  kZSTDCompression,
  kZSTDCompression,
  kZSTDCompression
};

The per-level compression_per_level is particularly nice for CFs: hot levels stay uncompressed for low-latency reads; cold levels use ZSTD for storage density. And because CFs have independent levels, two CFs in the same DB can have completely different compression strategies.

Atomic Flush

All-or-nothing flush across CFs to keep on-disk state consistent.

Without atomic flush, each CF's memtable flushes independently. If a transaction touched CFs A and B, and A flushes before a crash but B does not, on recovery you'll replay B from the WAL — but A's data is already on disk. The DB ends up consistent (sequence numbers ensure it) but during the flush window, the on-disk state shows A's writes and not B's.

For systems that read SST files directly (incremental backups, snapshot exports), this asymmetry can leak inconsistency. DBOptions::atomic_flush = true forces all CFs to flush together: when any CF's memtable hits its limit, all CFs flush in one synchronized step. Files written are visible only after every CF's new SST is committed to MANIFEST.

Common Patterns

How real systems use CFs.

TiKV uses three CFs per Raft region: default for user-visible data, lock for transactional row locks (small, hot, expires fast), write for MVCC commit records. The lock CF uses small memtables and aggressive compaction (locks must be cleaned up quickly); default CF uses balanced settings; write CF can tolerate higher read amp because reads of historical commit records are rare.

MyRocks (MySQL on RocksDB) uses one CF per index. Each index has its own bloom sizing — primary keys may have 16 bits/key for fast point lookups while secondary indexes accept 10. Per-CF compaction priorities ensure hot indexes get compaction CPU first.

TTL data + permanent state: a common pattern is a default CF with leveled compaction for permanent state, plus one or more FIFO-compaction CFs for time-windowed streams. The FIFO CFs auto-drop old data without compaction overhead — the equivalent of Redis's MAXLEN at the LSM level.

FAQ

Why use column families instead of just multiple databases?

Column families share the WAL — a write that touches multiple CFs is atomic across all of them with one fsync. Multiple separate databases would each have their own WAL and you'd lose cross-database atomicity, plus you'd pay multiple fsyncs per write. Use CFs when you need different storage profiles (TTL data, indexes, hot keys vs cold blobs) within one logical database. Use separate databases when you genuinely have isolated state.

Do CFs share memtable memory?

Each CF has its own memtable with its own size (write_buffer_size). The total memtable memory budget is the sum across all CFs unless you set a global write_buffer_manager that pools the budget. Without the manager, an idle CF still takes up its full memtable size. With the manager, RAM is shared and idle CFs cost nothing. For deployments with many CFs (TiKV uses ~3 per region; CockroachDB uses several), the write_buffer_manager is essential.

What's atomic flush?

If a transaction touches CFs A, B, and C, you want either all three CFs' new data flushed together or none — to keep their on-disk state consistent. atomic_flush=true forces RocksDB to flush all CFs together when any of them hits its memtable limit. Without atomic flush, a crash mid-flush could leave A's data in an SST while B's is still in WAL, causing on-disk inconsistency. Required if you have logical relationships across CFs (TiKV uses it; CockroachDB does too).

Can each CF have its own compaction style?

Yes. Per-CF options.compaction_style lets you mix leveled and universal compaction in the same DB. Useful pattern: a hot-data CF with universal compaction (low write amp), and a cold/secondary-index CF with leveled (low space amp). Per-CF target_file_size_base and write_buffer_size let you size each independently — small files for the hot CF (fast compactions, good for fresh data) and large for the cold (better compression, less metadata overhead).

How do CFs interact with snapshots?

A snapshot is database-wide — it captures a sequence number across all CFs. Reading from any CF at that snapshot gives you the state at that moment. This is what makes cross-CF consistent reads possible: a transaction's snapshot pins the same sequence number for all the CFs it might touch. The downside: a long-held snapshot prevents compaction across all CFs, even ones the transaction never touches, because compaction can't drop tombstones older than any live snapshot.

Real-world examples?

TiKV uses 3 CFs per region: 'default' for user data, 'lock' for transactional locks, 'write' for MVCC commit records. CockroachDB used multiple CFs in earlier versions for similar separation. Facebook's MyRocks uses one CF per index. Each CF gets its own bloom filter sizing, block cache, and compaction profile — the indexes can have aggressive read tuning while user data uses balanced tuning.