RocksDB Transactions
RocksDB ships two transaction implementations: TransactionDB (pessimistic, with
lock acquisition on writes and explicit GetForUpdate for reads-that-precede-writes)
and OptimisticTransactionDB (no locks; conflicts detected at commit by checking
sequence numbers). Pessimistic mode further splits into WriteCommitted (buffers
until commit, default) and WritePrepared (writes intents immediately, used by
TiKV). All flavors integrate with snapshots, range deletes, and column families. This page
explains the protocols, the tradeoffs, and the practical guidance for picking among them.
Transaction Mode Comparison
Key Numbers
Why Multiple Transaction Modes
TransactionDB (Pessimistic, WriteCommitted)
The default. Locks on Put, conflict detection at Commit is automatic.
TransactionDB* tdb;
TransactionDBOptions txn_opts;
TransactionDB::Open(db_opts, txn_opts, "/path", &tdb);
Transaction* txn = tdb->BeginTransaction(WriteOptions());
txn->Put("key1", "v1"); // acquires lock on key1
auto v = txn->Get(read_opts, "key2"); // no lock (read-only)
txn->GetForUpdate(read_opts, "key3"); // acquires lock on key3
txn->Put("key3", "v3");
Status s = txn->Commit();
delete txn;
Each Put acquires an exclusive row lock; the lock is held until commit or
rollback. If another transaction holds the lock, the calling txn waits up to
lock_timeout (default 1000 ms). If a wait would create a deadlock, the
detector aborts one txn with Status::Busy().
GetForUpdate is the read-with-intent-to-write primitive. It acquires the same
kind of lock as Put would, preventing a race between read and write within the same txn.
Without it, you could read a value, decide to update it, but have another txn write the
same key in between — your update overwrites their write, lost-update bug.
Internally, all writes are buffered in the txn's WriteBatch in memory. At commit, the whole batch is appended to the WAL and applied to the memtable atomically. If the txn was large (millions of writes), this means GBs of write batch held in RAM until commit.
OptimisticTransactionDB
No locks. Conflict detection at commit by snapshot comparison.
OptimisticTransactionDB* odb;
OptimisticTransactionDB::Open(db_opts, "/path", &odb);
OptimisticTransactionOptions otxn_opts;
otxn_opts.set_snapshot = true;
Transaction* txn = odb->BeginTransaction(WriteOptions(), otxn_opts);
txn->Put("key1", "v1"); // no lock — just buffered
txn->Get(read_opts, "key2"); // reads at txn start snapshot
Status s = txn->Commit();
if (s.IsBusy()) {
// conflict detected at commit; retry the txn
} OCC works by snapshotting at txn start, recording all keys the txn writes, and at commit time checking: has any of those keys been modified since my snapshot? If yes, abort. If no, commit by appending to WAL.
Since there are no locks, multiple txns can race; only the first to commit succeeds, the others retry. Throughput is excellent under low contention (no lock overhead at all). Under high contention, the same txn may retry many times — sometimes worse than pessimistic. Tune by application: detect when retry rate is high and switch to pessimistic.
WritePrepared (Pessimistic, Streaming Writes)
Writes go to WAL/memtable as they happen; commit just marks visible.
WriteCommitted buffers all writes in memory until commit. For a txn with 100M
writes, that's a problem. WritePrepared instead streams each Put directly to
the WAL and memtable, tagged with the txn's prepare-sequence-number. Other readers see those
entries but filter them out — only entries from committed transactions are visible.
Reading under WritePrepared maintains a "commit cache" — a set of (prepare_seq, commit_seq) pairs for recently committed txns. A reader at snapshot S sees an entry written at prepare sequence P if and only if P ≤ commit_seq for some committed txn whose commit_seq ≤ S. Filtering happens during merge in the iterator/Get path; ~5% slower reads in exchange for bounded memory on writes.
Used by TiKV for percolator-style 2PC: the prepare phase writes intent records (which other txns can see and respect), and the commit phase atomically marks them committed. Without WritePrepared, the prepare phase would have to be all in memory — impossible for large Raft-replicated transactions.
2PC: Prepare and Commit
Two-phase commit primitives for distributed transactions.
txn->SetName("txn-uuid-1234"); // give it a stable identifier
txn->Put(...);
Status s = txn->Prepare(); // write prepare record to WAL
// some other coordination happens — txn is durable but not yet committed
if (other_nodes_say_yes) {
txn->Commit(); // mark committed
} else {
txn->Rollback();
} Prepare writes a "prepare" log record durably; if the process crashes between
Prepare and Commit, on recovery the txn is left in prepared state and the application can
either Commit or Rollback it. This is the building block for distributed 2PC: the
coordinator gets all participants to Prepare, then on quorum issues Commit; if any
participant fails to Prepare, the coordinator issues Rollback to all.
Snapshots and Long Transactions
Why holding a transaction open is costly beyond memory.
A transaction's snapshot pins a sequence number. As long as the txn is alive (not Committed or Rolled-back), compaction cannot drop tombstones or stale versions older than that sequence number — they might still be visible to the txn's reads. Long-running txns therefore cause space amplification.
Worse: if you have a txn that opened a snapshot at sequence 1000 and is still alive, every compaction across all CFs has to keep entries with sequence ≥ 1000 even if logically dead. Long txns × many writes × many CFs = lots of unrecyclable disk.
Best practice: keep transactions short-lived. If you need to hold state across longer
windows (a multi-minute analytical scan), do it without a transaction or accept the space
cost explicitly. Monitor rocksdb.oldest-snapshot-sequence and alert if it
lags too far behind the latest sequence.
FAQ
Pessimistic vs optimistic — which should I use?
Pessimistic (TransactionDB) acquires locks on each Put/Delete and blocks conflicting transactions. Best when contention is high and conflict resolution must be deterministic. Optimistic (OptimisticTransactionDB) skips locking entirely; conflicts are detected at Commit time by checking sequence numbers — if a conflicting write happened, Commit fails and you retry. Best when contention is low: most txns succeed without coordination overhead. Most distributed systems built on RocksDB (TiKV, CockroachDB) use pessimistic for predictability.
What's the difference between WriteCommitted and WritePrepared?
Both are pessimistic flavors. WriteCommitted (default) buffers writes in memory until Commit, then writes them all to the WAL and memtable atomically. Drawback: memory usage scales with active txn write set. WritePrepared writes intent records directly to the WAL/memtable as they happen, with a commit marker added at Commit time; readers filter out uncommitted intents using a snapshot of which prepared txns have committed. Lower memory, higher complexity. WritePrepared is what TiKV uses for percolator-style 2PC.
How do snapshots interact with transactions?
Every transaction can request a snapshot (txn->GetSnapshot()) — this pins a sequence number and reads see only writes ≤ that number. With a snapshot, reads are repeatable within the transaction. Without one, each Get sees the latest committed state, which can drift between Gets. Snapshots also block compaction from dropping tombstones older than them, so long-held snapshots cause space amp.
Can I do range deletes inside a transaction?
Yes via DeleteRange. The range delete is buffered with the rest of the txn's writes and applied atomically on Commit. Caveat: range deletes themselves are tombstones in RocksDB and have their own performance implications during read (Gets in a range that has a pending range tombstone do a slightly more expensive lookup). Range delete tombstones are reclaimed only when no snapshot can see anything they shadow, so long txns prevent reclamation here too.
What does GetForUpdate do?
Acquires a read lock under pessimistic mode (TransactionDB). Used to read-then-write within a transaction without races: the lock prevents another txn from modifying the key between your read and your subsequent Put. Without GetForUpdate, an OCC-style write-write race could pass conflict detection if the conflicting writer didn't update the same key (because conflict detection only checks the keys you wrote, not the keys you read). Use GetForUpdate when correctness depends on a key not changing between read and write.
How are deadlocks detected?
TransactionDB has a deadlock detector that maintains a wait-for graph. When txn A blocks on a lock held by B, the detector adds an edge A→B. If it finds a cycle (A→B→C→A), one of the txns is rolled back with Status::Busy(). The default detector depth is 50 — beyond that it gives up and reports busy. For systems on top of RocksDB (CockroachDB), the higher-level transaction manager often does its own deadlock detection across nodes; RocksDB's local detector handles single-node cases.