TiDB Internals
TiDB is a distributed SQL database that speaks the MySQL wire protocol on one end and stores data in a Raft-replicated, range-sharded key-value store (TiKV) on the other. What makes it architecturally interesting is the strict separation of concerns: a stateless SQL layer (tidb-server), a row-oriented storage layer (TiKV, written in Rust on top of RocksDB), a columnar replica (TiFlash) that lives in the same Raft group, and a coordinator (PD) that tracks every 96 MB Region in the cluster. The whole stack implements Google's Percolator transaction model β distributed 2PC with timestamp ordering and a single primary lock per transaction.
Architecture Overview
Key Numbers
Why TiDB Exists
The Three Tiers
A TiDB deployment has exactly three component types, each scaled independently:
| Component | Language | State | Scales by |
|---|---|---|---|
tidb-server | Go | Stateless (schema cache) | Adding nodes behind LB |
tikv-server | Rust | Persistent (data + Raft log) | Adding nodes; PD rebalances Regions |
pd-server | Go | Persistent metadata (small) | Fixed 3 or 5 (Raft quorum) |
tiflash | C++ (ClickHouse fork) | Columnar replica (Raft learner) | Per-table replica count |
The split lets you run, say, 30 tidb-servers (CPU-bound query execution), 6 TiKV stores with NVMe (I/O bound storage), 3 PD nodes (tiny, only metadata), and 4 TiFlash nodes (analytics workloads). Compare to CockroachDB where every node runs SQL + storage + gossip together β simpler to deploy but harder to scale heterogeneously.
TiKV: Range-Sharded RocksDB
Every TiKV store runs two RocksDB instances per data directory: kv
holds the user data, raft holds Raft logs. The global key
space is partitioned into Regions, each a contiguous [start_key, end_key)
range with a default target of 96 MB and a split threshold of 144 MB.
// PD's region metadata for a table row
Region {
id: 12_345,
start_key: t\x80\x00\x00\x00\x00\x00\x00\x07_r\x80\x00\x00\x00\x00\x00\xa1\x00,
end_key: t\x80\x00\x00\x00\x00\x00\x00\x07_r\x80\x00\x00\x00\x00\x01\xa1\x00,
region_epoch: { conf_ver: 3, version: 22 },
peers: [
{ id: 12_346, store_id: 1, role: Voter }, // Raft leader (probably)
{ id: 12_347, store_id: 2, role: Voter },
{ id: 12_348, store_id: 3, role: Voter },
{ id: 12_349, store_id: 7, role: Learner } // TiFlash learner
]
}
The t\x80... prefix encodes table ID 7. The _r
separator distinguishes data rows from indexes (which use _i).
Inside the row prefix, the row's primary key is encoded in TiDB's
memcomparable format so RocksDB's bytewise-sorted SST files yield correct
SQL ordering. Every Region runs an independent Raft group; region_epoch.version
bumps on every split/merge so stale routes are rejected by TiKV.
The Placement Driver (PD)
PD is the cluster's brain. A 3- or 5-node etcd-backed Raft group, it holds every Region's location, allocates monotonically increasing timestamps (the TSO, fetched per transaction), and runs the scheduler that moves Region replicas around to balance load.
- TSO allocator. Hands out (physical_ms, logical_counter) pairs. tidb-servers batch up to 10 000 TSO requests per RPC. Latency: ~1ms in-DC, dominates short txns.
- Region heartbeats. Every TiKV leader heartbeats its Region to PD every 60s with size, key count, read/write QPS. PD uses these to decide moves.
- Schedulers.
balance-region-schedulerevens storage;balance-leader-schedulerevens leadership;hot-region-schedulerreacts to write hot spots;evict-leader-schedulerdrains a node before maintenance. - Placement rules. Constrain replicas by labels: e.g., "3 voters spread across 3 zones, 1 learner in zone D for TiFlash."
Percolator Transactions
TiDB transactions follow the Google Percolator
paper. Snapshot reads are easy: each transaction takes a start_ts from PD
and reads the most recent committed version <= start_ts for every
key. Writes are a 2-phase protocol with a "primary" key that linearizes
the commit.
// Pseudo-code of a Percolator commit
fn commit(write_set: Vec<(Key, Value)>) -> Result<()> {
let start_ts = pd.get_tso();
let primary = write_set[0].0.clone(); // pick the first key
// PHASE 1 β prewrite every key
for (key, value) in &write_set {
tikv.prewrite(key, value, primary.clone(), start_ts)?;
// writes a Lock record at (key, start_ts) pointing to primary
// FAILS if any other lock or any write > start_ts exists
}
// Get commit_ts AFTER all prewrites (so commit_ts > start_ts of any
// overlapping prewrite that locked the same key first)
let commit_ts = pd.get_tso();
// PHASE 2 β commit primary first, then secondaries
tikv.commit(primary.clone(), start_ts, commit_ts)?; // durable point
for (key, _) in &write_set[1..] {
tikv.commit(key.clone(), start_ts, commit_ts)?; // best-effort
}
Ok(())
} The clever bit. If the coordinator crashes after committing the primary but before committing all secondaries, the transaction is still semantically committed β any future reader who encounters a stale lock on a secondary will look up the primary's commit record and roll the secondary forward (or back if the primary was rolled back). This means there is no central commit log, and the system recovers from any single failure without a coordinator.
The cost. Each key in the write set is at least 2 RPCs. A 100-key transaction across 5 Regions on 3 stores is ~200 RPCs. TiDB has an "async commit" mode (since 5.0) that returns to the client after prewrite completes (skipping the commit_ts roundtrip), and a "1PC" fast path for transactions whose entire write set fits in one Region.
Coprocessor Pushdown
Naively, tidb-server would fetch every row from TiKV and filter/aggregate in Go. That would be unworkable β the network would saturate. Instead, TiDB pushes execution fragments down into TiKV via the coprocessor interface: predicates, projections, simple aggregates, top-N, and even some hash join build-sides execute inside the TiKV process.
EXPLAIN ANALYZE SELECT category, COUNT(*)
FROM orders
WHERE created_at >= '2025-01-01' AND status = 'shipped'
GROUP BY category;
HashAgg (final agg in tidb-server)
ββ TableReader (fetches partial aggregates)
ββ HashAgg β pushed down β runs in TiKV coprocessor
ββ Selection β pushed down β predicate evaluated in TiKV
ββ TableFullScan With pushdown, only the per-Region partial aggregates cross the network. For TiFlash, the entire MPP plan can run in the columnar layer with shuffle exchanges between TiFlash nodes, returning a final result to tidb-server. This is what makes HTAP queries on TiDB competitive with dedicated OLAP systems on the same hardware.
TiFlash: The Columnar Replica
A standard TiKV Region has 3 voting Raft peers. Add a learner and you get a 4th replica that participates in Raft log replication but doesn't vote. TiFlash is exactly that: a learner whose state machine is a columnar storage engine derived from ClickHouse.
-- Add 2 TiFlash replicas for the orders table
ALTER TABLE orders SET TIFLASH REPLICA 2;
-- Check status
SELECT TABLE_SCHEMA, TABLE_NAME, AVAILABLE, PROGRESS
FROM information_schema.tiflash_replica;
-- Force the optimizer to use TiFlash
SELECT /*+ READ_FROM_STORAGE(TIFLASH[orders]) */
category, SUM(total)
FROM orders
GROUP BY category; Because TiFlash is in the Raft group, reads are snapshot-consistent with TiKV under the same Percolator timestamp. The optimizer maintains a cost model that knows how much each engine charges for a scan (TiFlash cheap, TiKV expensive for full scans) versus a point lookup (TiKV cheap, TiFlash expensive). It picks per-table-access, so a single query can mix engines.
Online Schema Changes
TiDB implements the F1 paper's "online async schema change" algorithm. A DDL goes through five public states β none β delete-only β write-only β write-reorganization β public β and at every transition there are at most two adjacent states active across the cluster simultaneously. tidb-servers poll the schema version and lazy-load.
- Add column. Non-blocking. The new column is null/default until backfilled in the write-reorg state.
- Add index. Non-blocking. Backfill scans the table in batches; concurrent writes update both old and new state per F1.
- Drop column. Goes through delete-only first, then physical removal happens in the GC pass.
- Lossy data type changes (e.g., VARCHAR(10) β VARCHAR(5)): require an offline migration. TiDB is honest about it and rejects them.
Tradeoffs & When To Use
TiDB vs CockroachDB
| TiDB | CockroachDB | |
|---|---|---|
| Wire protocol | MySQL | PostgreSQL |
| Components | Separate SQL/storage/PD/TiFlash binaries | Single cockroach binary, all roles per node |
| Storage | RocksDB (Rust wrapper) | Pebble (Go reimpl of RocksDB) |
| Transaction model | Percolator (primary lock) | Parallel commits + HLC |
| Default isolation | Snapshot (named "RR") | Serializable |
| Timestamp source | PD (centralized TSO) | HLC (per-node, gossiped) |
| Columnar engine | TiFlash (Raft learner) | None |
| Region/Range size | 96 MB | 512 MB (default) |
FAQ
Is TiDB a fork of MySQL?
No. TiDB reimplements the MySQL wire protocol and a substantial subset of the MySQL grammar from scratch in Go. The storage layer (TiKV) is a separate Rust project built on RocksDB. The wire compatibility means most MySQL drivers, ORMs, and admin tools work unchanged, but the optimizer, executor, statistics format, and DDL behavior are all different. Some MySQL features are unsupported or behave differently: foreign keys are non-enforced (validated at parse time only until 6.6+), stored procedures don't exist, and certain isolation levels are remapped (READ COMMITTED behaves like Oracle's read committed, not MySQL's).
How does TiDB compare to CockroachDB?
Both are NewSQL: distributed, MVCC, range-sharded, Raft-replicated, online schema changes via Google's F1 paper. The key differences: TiDB separates SQL and storage into distinct binaries (tidb-server is stateless SQL, TiKV holds data), CockroachDB co-locates them. TiDB speaks MySQL wire protocol, Cockroach speaks PostgreSQL. TiDB's transaction model is Percolator (timestamp ordering with primary lock), Cockroach uses parallel commits over MVCC + HLC. TiDB ships with an explicit columnar engine (TiFlash) for HTAP; Cockroach has no columnar story. Cockroach defaults to serializable; TiDB defaults to snapshot isolation (called 'repeatable read' for MySQL compatibility, but it really is snapshot).
What is a Region in TiKV?
A Region is the unit of distribution and replication in TiKV. Every Region holds a contiguous range of the global key space, defaults to 96 MB / 96k keys (split threshold: 144 MB), and is replicated to 3 (default) or 5 TiKV stores using Raft. The Placement Driver (PD) tracks every Region's metadata β start/end key, leader store, peer stores β and rebalances by moving replicas. When a Region grows beyond the threshold it auto-splits at the median key. When two adjacent Regions are both small, they may merge. There can be millions of Regions in a large cluster.
How does Percolator-based transactions actually work?
TiDB encodes a transaction as a write set of (key, value) pairs. On commit it picks one primary key (heuristically the first in the set), then in a prewrite phase it Raft-replicates a lock record on every key (locks reference back to the primary). If all prewrites succeed, the commit phase first commits the primary by writing a 'write' record at the commit timestamp; once that record is durable the transaction is logically committed even if secondary commits crash mid-flight. Secondaries are then asynchronously committed (or rolled forward by future readers who follow the primary lock). The protocol gives serializable snapshot reads with no central commit log β at the cost of two RPCs per key.
Why does TiDB have a separate stateless SQL layer?
Decoupling SQL from storage means SQL nodes scale independently from data. You can run 30 tidb-servers behind a load balancer, each holding only the cached schema and ongoing transaction state β restart one and you lose nothing durable. It also makes adding TiFlash (the columnar replica) clean: tidb-server picks at planning time whether to read each table from TiKV (row, OLTP) or TiFlash (column, OLAP). The cost is a network hop on every key access β TiDB invests heavily in coprocessor pushdown to amortize it.
What is TiFlash and how does HTAP actually work?
TiFlash is a columnar storage engine (modeled on ClickHouse) that runs as an additional Raft learner on selected tables. The TiKV leader replicates Raft logs to TiFlash learners, which decode and replay them into a columnar PageStorage format. Because TiFlash is in the Raft group, its data is consistent with TiKV β reads against TiFlash use the same Percolator timestamp and see the same snapshot. The TiDB optimizer cost-models both engines and routes scan-heavy queries to TiFlash. The downside: TiFlash doubles your storage footprint for replicated tables and adds CPU pressure on TiKV leaders.
When does TiDB struggle?
Hot shard / hot Region scenarios. A monotonically increasing primary key (AUTO_INCREMENT, timestamp-prefixed UUID) makes every write hit the same Region until it splits β and PD takes 10-30s to detect and rebalance. The fix is AUTO_RANDOM or hash-prefixed keys. The other pain is small-transaction throughput: every commit is at minimum 2 RPCs * (1 + RTT to PD for timestamp) * 2 phases. For pure single-row OLTP at 50k+ TPS, vanilla MySQL with read replicas often outperforms TiDB until you're past the size where MySQL would need sharding anyway.