RocksDB Transactions

RocksDB ships two transaction implementations: TransactionDB (pessimistic, with lock acquisition on writes and explicit GetForUpdate for reads-that-precede-writes) and OptimisticTransactionDB (no locks; conflicts detected at commit by checking sequence numbers). Pessimistic mode further splits into WriteCommitted (buffers until commit, default) and WritePrepared (writes intents immediately, used by TiKV). All flavors integrate with snapshots, range deletes, and column families. This page explains the protocols, the tradeoffs, and the practical guidance for picking among them.

Transaction Mode Comparison

Key Numbers

Default mode

WriteCommitted

Lock timeout

1000 ms default

Deadlock depth

Snapshot pin

blocks GC

2PC support

via Prepare

Range delete

supported

Cross-CF txn

supported

Why Multiple Transaction Modes

Different contention profiles

High contention (many txns racing on same keys) wants pessimistic — locks serialize predictably. Low contention wants optimistic — most txns commit on first try, no lock overhead.

Different memory budgets

A txn with 1M writes blows up WriteCommitted's in-memory write batch. WritePrepared streams writes through the WAL/memtable as they happen — bounded memory, but readers must filter uncommitted intents.

Different consistency needs

Distributed systems on top of RocksDB (TiKV, CockroachDB) coordinate transactions across nodes — they need RocksDB to expose primitives that play nice with their cross-node logic. WritePrepared + 2PC is the bridge.

TransactionDB (Pessimistic, WriteCommitted)

The default. Locks on Put, conflict detection at Commit is automatic.

TransactionDB* tdb;
TransactionDBOptions txn_opts;
TransactionDB::Open(db_opts, txn_opts, "/path", &tdb);

Transaction* txn = tdb->BeginTransaction(WriteOptions());

txn->Put("key1", "v1");                // acquires lock on key1
auto v = txn->Get(read_opts, "key2");  // no lock (read-only)
txn->GetForUpdate(read_opts, "key3");  // acquires lock on key3
txn->Put("key3", "v3");

Status s = txn->Commit();
delete txn;

Each Put acquires an exclusive row lock; the lock is held until commit or rollback. If another transaction holds the lock, the calling txn waits up to lock_timeout (default 1000 ms). If a wait would create a deadlock, the detector aborts one txn with Status::Busy().

GetForUpdate is the read-with-intent-to-write primitive. It acquires the same kind of lock as Put would, preventing a race between read and write within the same txn. Without it, you could read a value, decide to update it, but have another txn write the same key in between — your update overwrites their write, lost-update bug.

Internally, all writes are buffered in the txn's WriteBatch in memory. At commit, the whole batch is appended to the WAL and applied to the memtable atomically. If the txn was large (millions of writes), this means GBs of write batch held in RAM until commit.

OptimisticTransactionDB

No locks. Conflict detection at commit by snapshot comparison.

OptimisticTransactionDB* odb;
OptimisticTransactionDB::Open(db_opts, "/path", &odb);

OptimisticTransactionOptions otxn_opts;
otxn_opts.set_snapshot = true;
Transaction* txn = odb->BeginTransaction(WriteOptions(), otxn_opts);

txn->Put("key1", "v1");           // no lock — just buffered
txn->Get(read_opts, "key2");      // reads at txn start snapshot

Status s = txn->Commit();
if (s.IsBusy()) {
  // conflict detected at commit; retry the txn
}

OCC works by snapshotting at txn start, recording all keys the txn writes, and at commit time checking: has any of those keys been modified since my snapshot? If yes, abort. If no, commit by appending to WAL.

Since there are no locks, multiple txns can race; only the first to commit succeeds, the others retry. Throughput is excellent under low contention (no lock overhead at all). Under high contention, the same txn may retry many times — sometimes worse than pessimistic. Tune by application: detect when retry rate is high and switch to pessimistic.

WritePrepared (Pessimistic, Streaming Writes)

Writes go to WAL/memtable as they happen; commit just marks visible.

WriteCommitted buffers all writes in memory until commit. For a txn with 100M writes, that's a problem. WritePrepared instead streams each Put directly to the WAL and memtable, tagged with the txn's prepare-sequence-number. Other readers see those entries but filter them out — only entries from committed transactions are visible.

Reading under WritePrepared maintains a "commit cache" — a set of (prepare_seq, commit_seq) pairs for recently committed txns. A reader at snapshot S sees an entry written at prepare sequence P if and only if P ≤ commit_seq for some committed txn whose commit_seq ≤ S. Filtering happens during merge in the iterator/Get path; ~5% slower reads in exchange for bounded memory on writes.

Used by TiKV for percolator-style 2PC: the prepare phase writes intent records (which other txns can see and respect), and the commit phase atomically marks them committed. Without WritePrepared, the prepare phase would have to be all in memory — impossible for large Raft-replicated transactions.

2PC: Prepare and Commit

Two-phase commit primitives for distributed transactions.

txn->SetName("txn-uuid-1234");          // give it a stable identifier
txn->Put(...);
Status s = txn->Prepare();              // write prepare record to WAL

// some other coordination happens — txn is durable but not yet committed

if (other_nodes_say_yes) {
  txn->Commit();                        // mark committed
} else {
  txn->Rollback();
}

Prepare writes a "prepare" log record durably; if the process crashes between Prepare and Commit, on recovery the txn is left in prepared state and the application can either Commit or Rollback it. This is the building block for distributed 2PC: the coordinator gets all participants to Prepare, then on quorum issues Commit; if any participant fails to Prepare, the coordinator issues Rollback to all.

Snapshots and Long Transactions

Why holding a transaction open is costly beyond memory.

A transaction's snapshot pins a sequence number. As long as the txn is alive (not Committed or Rolled-back), compaction cannot drop tombstones or stale versions older than that sequence number — they might still be visible to the txn's reads. Long-running txns therefore cause space amplification.

Worse: if you have a txn that opened a snapshot at sequence 1000 and is still alive, every compaction across all CFs has to keep entries with sequence ≥ 1000 even if logically dead. Long txns × many writes × many CFs = lots of unrecyclable disk.

Best practice: keep transactions short-lived. If you need to hold state across longer windows (a multi-minute analytical scan), do it without a transaction or accept the space cost explicitly. Monitor rocksdb.oldest-snapshot-sequence and alert if it lags too far behind the latest sequence.

FAQ

Pessimistic vs optimistic — which should I use?

Pessimistic (TransactionDB) acquires locks on each Put/Delete and blocks conflicting transactions. Best when contention is high and conflict resolution must be deterministic. Optimistic (OptimisticTransactionDB) skips locking entirely; conflicts are detected at Commit time by checking sequence numbers — if a conflicting write happened, Commit fails and you retry. Best when contention is low: most txns succeed without coordination overhead. Most distributed systems built on RocksDB (TiKV, CockroachDB) use pessimistic for predictability.

What's the difference between WriteCommitted and WritePrepared?

Both are pessimistic flavors. WriteCommitted (default) buffers writes in memory until Commit, then writes them all to the WAL and memtable atomically. Drawback: memory usage scales with active txn write set. WritePrepared writes intent records directly to the WAL/memtable as they happen, with a commit marker added at Commit time; readers filter out uncommitted intents using a snapshot of which prepared txns have committed. Lower memory, higher complexity. WritePrepared is what TiKV uses for percolator-style 2PC.

How do snapshots interact with transactions?

Every transaction can request a snapshot (txn->GetSnapshot()) — this pins a sequence number and reads see only writes ≤ that number. With a snapshot, reads are repeatable within the transaction. Without one, each Get sees the latest committed state, which can drift between Gets. Snapshots also block compaction from dropping tombstones older than them, so long-held snapshots cause space amp.

Can I do range deletes inside a transaction?

Yes via DeleteRange. The range delete is buffered with the rest of the txn's writes and applied atomically on Commit. Caveat: range deletes themselves are tombstones in RocksDB and have their own performance implications during read (Gets in a range that has a pending range tombstone do a slightly more expensive lookup). Range delete tombstones are reclaimed only when no snapshot can see anything they shadow, so long txns prevent reclamation here too.

What does GetForUpdate do?

Acquires a read lock under pessimistic mode (TransactionDB). Used to read-then-write within a transaction without races: the lock prevents another txn from modifying the key between your read and your subsequent Put. Without GetForUpdate, an OCC-style write-write race could pass conflict detection if the conflicting writer didn't update the same key (because conflict detection only checks the keys you wrote, not the keys you read). Use GetForUpdate when correctness depends on a key not changing between read and write.

How are deadlocks detected?

TransactionDB has a deadlock detector that maintains a wait-for graph. When txn A blocks on a lock held by B, the detector adds an edge A→B. If it finds a cycle (A→B→C→A), one of the txns is rolled back with Status::Busy(). The default detector depth is 50 — beyond that it gives up and reports busy. For systems on top of RocksDB (CockroachDB), the higher-level transaction manager often does its own deadlock detection across nodes; RocksDB's local detector handles single-node cases.