TiDB Internals

TiDB is a distributed SQL database that speaks the MySQL wire protocol on one end and stores data in a Raft-replicated, range-sharded key-value store (TiKV) on the other. What makes it architecturally interesting is the strict separation of concerns: a stateless SQL layer (tidb-server), a row-oriented storage layer (TiKV, written in Rust on top of RocksDB), a columnar replica (TiFlash) that lives in the same Raft group, and a coordinator (PD) that tracks every 96 MB Region in the cluster. The whole stack implements Google's Percolator transaction model β€” distributed 2PC with timestamp ordering and a single primary lock per transaction.

Architecture Overview

CLIENT MySQL Driver / App STATELESS SQL TIER tidb-server #1 parser β€’ planner β€’ executor tidb-server #2 stateless β€’ horizontally scaled tidb-server #N no shared state PD cluster TSO β€’ region map STORAGE TIER (Raft-replicated Regions) TiKV store 1 (Rust) Region 1 L Region 2 F Region 3 F RocksDB instances β€” kv (data) β€” raft (log) 96 MB regions β€’ coprocessor TiKV store 2 Region 1 F Region 2 L Region 4 F L = leader F = follower PD rebalances replicas 3-replica default (R3) TiFlash node (columnar) Region 1 learner Region 2 learner Raft learner β€” read-only PageStorage columnar snapshot-consistent reads

Key Numbers

Region Size
96 MB
Split Threshold
144 MB
Replicas Default
3
Storage Engine
RocksDB
Wire Protocol
MySQL 5.7+
Txn Model
Percolator
Default Iso
Snapshot

Why TiDB Exists

The Gap
Operating MySQL past a few terabytes meant manual sharding (Vitess, ProxySQL, custom code), losing cross-shard transactions and sacrificing operational simplicity. Aurora and Cloud SQL only scaled reads, not writes. The Spanner paper showed it was possible to do better, but Google didn't ship it externally for years.
The Insight
Treat the SQL layer as stateless query compilation and put a transactional KV store underneath. If the KV store is range-sharded with auto-split + Raft, the SQL layer stays simple β€” it just translates MySQL semantics into KV operations and a Percolator transaction. Add a columnar Raft learner (TiFlash) and you also get analytics on the same cluster.
The Result
Drop-in MySQL replacement that scales to petabytes without sharding ceremony, with HTAP queries that route to row or column storage automatically. The cost is real β€” every write is a 2PC across nodes β€” but for workloads that already needed sharding, it's a strict improvement.

The Three Tiers

A TiDB deployment has exactly three component types, each scaled independently:

ComponentLanguageStateScales by
tidb-serverGoStateless (schema cache)Adding nodes behind LB
tikv-serverRustPersistent (data + Raft log)Adding nodes; PD rebalances Regions
pd-serverGoPersistent metadata (small)Fixed 3 or 5 (Raft quorum)
tiflashC++ (ClickHouse fork)Columnar replica (Raft learner)Per-table replica count

The split lets you run, say, 30 tidb-servers (CPU-bound query execution), 6 TiKV stores with NVMe (I/O bound storage), 3 PD nodes (tiny, only metadata), and 4 TiFlash nodes (analytics workloads). Compare to CockroachDB where every node runs SQL + storage + gossip together β€” simpler to deploy but harder to scale heterogeneously.

TiKV: Range-Sharded RocksDB

Every TiKV store runs two RocksDB instances per data directory: kv holds the user data, raft holds Raft logs. The global key space is partitioned into Regions, each a contiguous [start_key, end_key) range with a default target of 96 MB and a split threshold of 144 MB.

// PD's region metadata for a table row
Region {
  id: 12_345,
  start_key: t\x80\x00\x00\x00\x00\x00\x00\x07_r\x80\x00\x00\x00\x00\x00\xa1\x00,
  end_key:   t\x80\x00\x00\x00\x00\x00\x00\x07_r\x80\x00\x00\x00\x00\x01\xa1\x00,
  region_epoch: { conf_ver: 3, version: 22 },
  peers: [
    { id: 12_346, store_id: 1, role: Voter },     // Raft leader (probably)
    { id: 12_347, store_id: 2, role: Voter },
    { id: 12_348, store_id: 3, role: Voter },
    { id: 12_349, store_id: 7, role: Learner }    // TiFlash learner
  ]
}

The t\x80... prefix encodes table ID 7. The _r separator distinguishes data rows from indexes (which use _i). Inside the row prefix, the row's primary key is encoded in TiDB's memcomparable format so RocksDB's bytewise-sorted SST files yield correct SQL ordering. Every Region runs an independent Raft group; region_epoch.version bumps on every split/merge so stale routes are rejected by TiKV.

The Placement Driver (PD)

PD is the cluster's brain. A 3- or 5-node etcd-backed Raft group, it holds every Region's location, allocates monotonically increasing timestamps (the TSO, fetched per transaction), and runs the scheduler that moves Region replicas around to balance load.

Percolator Transactions

TiDB transactions follow the Google Percolator paper. Snapshot reads are easy: each transaction takes a start_ts from PD and reads the most recent committed version <= start_ts for every key. Writes are a 2-phase protocol with a "primary" key that linearizes the commit.

// Pseudo-code of a Percolator commit
fn commit(write_set: Vec<(Key, Value)>) -> Result<()> {
  let start_ts = pd.get_tso();
  let primary    = write_set[0].0.clone();    // pick the first key

  // PHASE 1 β€” prewrite every key
  for (key, value) in &write_set {
    tikv.prewrite(key, value, primary.clone(), start_ts)?;
    // writes a Lock record at (key, start_ts) pointing to primary
    // FAILS if any other lock or any write > start_ts exists
  }

  // Get commit_ts AFTER all prewrites (so commit_ts > start_ts of any
  // overlapping prewrite that locked the same key first)
  let commit_ts = pd.get_tso();

  // PHASE 2 β€” commit primary first, then secondaries
  tikv.commit(primary.clone(), start_ts, commit_ts)?;       // durable point
  for (key, _) in &write_set[1..] {
    tikv.commit(key.clone(), start_ts, commit_ts)?;          // best-effort
  }
  Ok(())
}

The clever bit. If the coordinator crashes after committing the primary but before committing all secondaries, the transaction is still semantically committed β€” any future reader who encounters a stale lock on a secondary will look up the primary's commit record and roll the secondary forward (or back if the primary was rolled back). This means there is no central commit log, and the system recovers from any single failure without a coordinator.

The cost. Each key in the write set is at least 2 RPCs. A 100-key transaction across 5 Regions on 3 stores is ~200 RPCs. TiDB has an "async commit" mode (since 5.0) that returns to the client after prewrite completes (skipping the commit_ts roundtrip), and a "1PC" fast path for transactions whose entire write set fits in one Region.

Coprocessor Pushdown

Naively, tidb-server would fetch every row from TiKV and filter/aggregate in Go. That would be unworkable β€” the network would saturate. Instead, TiDB pushes execution fragments down into TiKV via the coprocessor interface: predicates, projections, simple aggregates, top-N, and even some hash join build-sides execute inside the TiKV process.

EXPLAIN ANALYZE SELECT category, COUNT(*)
FROM orders
WHERE created_at >= '2025-01-01' AND status = 'shipped'
GROUP BY category;

  HashAgg            (final agg in tidb-server)
  └─ TableReader     (fetches partial aggregates)
     └─ HashAgg      β—€ pushed down β€” runs in TiKV coprocessor
        └─ Selection β—€ pushed down β€” predicate evaluated in TiKV
           └─ TableFullScan

With pushdown, only the per-Region partial aggregates cross the network. For TiFlash, the entire MPP plan can run in the columnar layer with shuffle exchanges between TiFlash nodes, returning a final result to tidb-server. This is what makes HTAP queries on TiDB competitive with dedicated OLAP systems on the same hardware.

TiFlash: The Columnar Replica

A standard TiKV Region has 3 voting Raft peers. Add a learner and you get a 4th replica that participates in Raft log replication but doesn't vote. TiFlash is exactly that: a learner whose state machine is a columnar storage engine derived from ClickHouse.

-- Add 2 TiFlash replicas for the orders table
ALTER TABLE orders SET TIFLASH REPLICA 2;

-- Check status
SELECT TABLE_SCHEMA, TABLE_NAME, AVAILABLE, PROGRESS
FROM information_schema.tiflash_replica;

-- Force the optimizer to use TiFlash
SELECT /*+ READ_FROM_STORAGE(TIFLASH[orders]) */
  category, SUM(total)
FROM orders
GROUP BY category;

Because TiFlash is in the Raft group, reads are snapshot-consistent with TiKV under the same Percolator timestamp. The optimizer maintains a cost model that knows how much each engine charges for a scan (TiFlash cheap, TiKV expensive for full scans) versus a point lookup (TiKV cheap, TiFlash expensive). It picks per-table-access, so a single query can mix engines.

Online Schema Changes

TiDB implements the F1 paper's "online async schema change" algorithm. A DDL goes through five public states β€” none β†’ delete-only β†’ write-only β†’ write-reorganization β†’ public β€” and at every transition there are at most two adjacent states active across the cluster simultaneously. tidb-servers poll the schema version and lazy-load.

Tradeoffs & When To Use

Use TiDB when
You're already on MySQL and outgrowing a single primary, you need true cross-row transactions across what would have been shards, you want to add analytics queries (TiFlash) without ETL to a warehouse, and you can run 9+ nodes minimum (3 PD, 3 TiKV, 2+ tidb-server).
Avoid TiDB when
Your workload is small enough for vanilla MySQL (the Percolator overhead doesn't pay off). You need PostgreSQL features. You want strong serializable by default β€” TiDB is snapshot isolation under the MySQL "RR" name. Your dataset has hot keys you can't randomize.
Operational gotchas
Hot Regions on monotonic keys (use AUTO_RANDOM). PD is a single Raft group β€” its TSO latency caps your minimum txn latency. TiFlash doubles storage for replicated tables. Region heartbeat storms during cluster-wide migrations. GC must keep up β€” abandoned MVCC versions accumulate fast under update-heavy load.

TiDB vs CockroachDB

TiDBCockroachDB
Wire protocolMySQLPostgreSQL
ComponentsSeparate SQL/storage/PD/TiFlash binariesSingle cockroach binary, all roles per node
StorageRocksDB (Rust wrapper)Pebble (Go reimpl of RocksDB)
Transaction modelPercolator (primary lock)Parallel commits + HLC
Default isolationSnapshot (named "RR")Serializable
Timestamp sourcePD (centralized TSO)HLC (per-node, gossiped)
Columnar engineTiFlash (Raft learner)None
Region/Range size96 MB512 MB (default)

FAQ

Is TiDB a fork of MySQL?

No. TiDB reimplements the MySQL wire protocol and a substantial subset of the MySQL grammar from scratch in Go. The storage layer (TiKV) is a separate Rust project built on RocksDB. The wire compatibility means most MySQL drivers, ORMs, and admin tools work unchanged, but the optimizer, executor, statistics format, and DDL behavior are all different. Some MySQL features are unsupported or behave differently: foreign keys are non-enforced (validated at parse time only until 6.6+), stored procedures don't exist, and certain isolation levels are remapped (READ COMMITTED behaves like Oracle's read committed, not MySQL's).

How does TiDB compare to CockroachDB?

Both are NewSQL: distributed, MVCC, range-sharded, Raft-replicated, online schema changes via Google's F1 paper. The key differences: TiDB separates SQL and storage into distinct binaries (tidb-server is stateless SQL, TiKV holds data), CockroachDB co-locates them. TiDB speaks MySQL wire protocol, Cockroach speaks PostgreSQL. TiDB's transaction model is Percolator (timestamp ordering with primary lock), Cockroach uses parallel commits over MVCC + HLC. TiDB ships with an explicit columnar engine (TiFlash) for HTAP; Cockroach has no columnar story. Cockroach defaults to serializable; TiDB defaults to snapshot isolation (called 'repeatable read' for MySQL compatibility, but it really is snapshot).

What is a Region in TiKV?

A Region is the unit of distribution and replication in TiKV. Every Region holds a contiguous range of the global key space, defaults to 96 MB / 96k keys (split threshold: 144 MB), and is replicated to 3 (default) or 5 TiKV stores using Raft. The Placement Driver (PD) tracks every Region's metadata β€” start/end key, leader store, peer stores β€” and rebalances by moving replicas. When a Region grows beyond the threshold it auto-splits at the median key. When two adjacent Regions are both small, they may merge. There can be millions of Regions in a large cluster.

How does Percolator-based transactions actually work?

TiDB encodes a transaction as a write set of (key, value) pairs. On commit it picks one primary key (heuristically the first in the set), then in a prewrite phase it Raft-replicates a lock record on every key (locks reference back to the primary). If all prewrites succeed, the commit phase first commits the primary by writing a 'write' record at the commit timestamp; once that record is durable the transaction is logically committed even if secondary commits crash mid-flight. Secondaries are then asynchronously committed (or rolled forward by future readers who follow the primary lock). The protocol gives serializable snapshot reads with no central commit log β€” at the cost of two RPCs per key.

Why does TiDB have a separate stateless SQL layer?

Decoupling SQL from storage means SQL nodes scale independently from data. You can run 30 tidb-servers behind a load balancer, each holding only the cached schema and ongoing transaction state β€” restart one and you lose nothing durable. It also makes adding TiFlash (the columnar replica) clean: tidb-server picks at planning time whether to read each table from TiKV (row, OLTP) or TiFlash (column, OLAP). The cost is a network hop on every key access β€” TiDB invests heavily in coprocessor pushdown to amortize it.

What is TiFlash and how does HTAP actually work?

TiFlash is a columnar storage engine (modeled on ClickHouse) that runs as an additional Raft learner on selected tables. The TiKV leader replicates Raft logs to TiFlash learners, which decode and replay them into a columnar PageStorage format. Because TiFlash is in the Raft group, its data is consistent with TiKV β€” reads against TiFlash use the same Percolator timestamp and see the same snapshot. The TiDB optimizer cost-models both engines and routes scan-heavy queries to TiFlash. The downside: TiFlash doubles your storage footprint for replicated tables and adds CPU pressure on TiKV leaders.

When does TiDB struggle?

Hot shard / hot Region scenarios. A monotonically increasing primary key (AUTO_INCREMENT, timestamp-prefixed UUID) makes every write hit the same Region until it splits β€” and PD takes 10-30s to detect and rebalance. The fix is AUTO_RANDOM or hash-prefixed keys. The other pain is small-transaction throughput: every commit is at minimum 2 RPCs * (1 + RTT to PD for timestamp) * 2 phases. For pure single-row OLTP at 50k+ TPS, vanilla MySQL with read replicas often outperforms TiDB until you're past the size where MySQL would need sharding anyway.