Vitess Internals

Vitess is the database layer that lets a single MySQL deployment scale to thousands of nodes without sacrificing the operational maturity of MySQL itself. Born at YouTube in 2010 and now a graduated CNCF project, it sits in front of unmodified MySQL servers as a routing, management, and replication tier — turning N MySQL primaries into one logical database. Slack, GitHub, HubSpot, and PlanetScale all run on it.

Based on the Vitess source, the official Vitess documentation, and Sugu Sougoumarane's published talks on the YouTube origin story.

Vitess Architecture

Key Numbers

Origin

YouTube 2010

Largest Cluster

10K+ tablets

CNCF Status

Graduated 2019

Backend

MySQL 5.7+/8.0

Routing Layer

VTGate

Per-Shard Agent

VTTablet

Replication Lag

~sub-second

Why Vitess Exists

The Gap

By 2010 YouTube's MySQL primaries were saturated. Vertical scaling was over: bigger boxes did not exist. Application-side sharding worked for writes but reinvented routing, failover, and migrations in every service. Switching to a NoSQL store would lose joins, transactions, and a decade of MySQL operator knowledge.

The Insight

Sharding is a routing problem, not a storage problem. If a layer in front of MySQL knew which shard owned which row, applications could keep speaking plain MySQL while the cluster grew underneath. The storage engine — InnoDB, with its proven replication and crash recovery — did not need to change at all.

The Result

Vitess shipped routing (VTGate), per-shard supervision (VTTablet), online resharding via binlog streaming (VReplication), and topology-driven failover. The same MySQL that ran on one box could now run on 10,000, with the operational primitives — backups, replicas, schema changes — surviving the transition.

VTGate: The Stateless Query Router

Every connection from your application lands on a VTGate. It is the brain of the routing layer.

VTGate speaks the MySQL wire protocol on the front and gRPC to VTTablets on the back. It is stateless, which means you run a fleet of them behind a load balancer and lose none of them matter — restart any VTGate and connections reconnect elsewhere. The state it needs (vschema, shard map, tablet health) lives in the topology server and is cached in memory with a watch-driven refresh.

A query enters VTGate as raw SQL bytes and traverses four stages. The parser builds a Vitess AST (a fork of the upstream MySQL parser, kept in go/vt/sqlparser). The analyzer resolves identifiers, walks the vschema, and tags expressions with their column origin. The planner uses Gen4 — Vitess's current planner — to decide which shards must run which sub-queries. The executor dispatches those sub-queries to VTTablets in parallel, then merges, sorts, aggregates, or limits the results before returning rows to the client.

Plan caching

Plans are expensive (parse + analyze + plan can take ms). VTGate caches them keyed on the normalized query string. Bind variables make this work — SELECT * FROM users WHERE id = ? produces one cache entry no matter how many user IDs you query.

VTTablet: The Per-Shard Agent

Every MySQL server in the cluster has a VTTablet sidecar. Together they form a tablet.

VTTablet is a single Go process that lives next to mysqld on the same host. It mediates every connection: VTGate never opens a raw MySQL connection to a backend, only to a VTTablet, which proxies it. This gives Vitess hooks for connection pooling, hot-row protection, query timeouts, transaction caps, query consolidation (deduplicating identical reads in flight), and per-table ACLs — features MySQL itself does not provide.

A VTTablet has a type that the topology server publishes: PRIMARY, REPLICA, RDONLY (replica that can be taken offline for batch jobs), BACKUP, RESTORE, or SPARE. Type transitions drive operational behavior. Reparenting (failover) is a coordinated dance across VTTablets and the topology server: one tablet stops being PRIMARY, another applies its remaining binlog events from the old primary, then claims PRIMARY in topo and starts accepting writes.

VTTablet also runs the schema engine (caches the live schema and detects drift), the tx engine (manages transactions and the reserved-connection pool for transactions that hold MySQL session state), and the messaging engine (Vitess's polled-table queue feature). Inside it sits VReplication, the binlog-streaming subsystem covered later.

The Topology Server

Vitess's only stateful component besides MySQL itself.

Vitess does not invent a consensus system. It plugs into etcd, ZooKeeper, or Consul and stores three kinds of data: the keyspace+shard map (which shards exist, what their key ranges are), the vschema (sharding rules per table), and tablet records (one per tablet — type, hostname, ports, alias). Leadership is derived: the tablet record with type=PRIMARY for a shard is the primary; failover is a topo write.

All Vitess components watch topo. VTGates watch the shard map and vschema so query routing reacts within milliseconds of an operator change. VTTablets watch their own record to learn type transitions and keyspace-level config. vtctld, the cluster management daemon, is the only writer for most operations — it serializes resharding, reparenting, and schema changes.

The topology load is light: a few KB per tablet, watched by stable processes. A 1,000-shard Vitess cluster can run on a 3-node etcd ring sized for 100 MB of state. The hot path (query routing, transactions) never touches topo — only the control plane does.

Keyspaces, Shards, and Vindexes

The data model that turns "MySQL" into "many MySQLs that look like one."

A keyspace is a logical database — what you would have called a database in plain MySQL. A keyspace can be unsharded (one shard, behaves like a normal MySQL database) or sharded (split into N shards by a key range). Each shard owns a contiguous range of the 8-byte keyspace ID space, written in hex: -80 means "from start up to 0x80…", 80-c0 means "0x80… up to 0xc0…", and so on.

A vindex maps column values to keyspace IDs. The primary vindex on each table is mandatory — it determines which shard a row lives on. Common primary vindexes: hash (8-byte hash of an integer column, uniform distribution), binary_md5 (MD5 of a string column), xxhash (faster non-crypto hash). Tables that should travel together (e.g. a user's orders and that user's profile) share the same primary vindex on their respective user_id columns, guaranteeing co-location.

Secondary vindexes let you route queries that filter on non-primary columns. A lookup vindex stores (column_value → keyspace_id) in a separate lookup table, often unsharded, often itself in Vitess. When VTGate sees WHERE email = ? on a table where email is a lookup vindex, it queries the lookup table first to learn which shard, then sends the actual query there — one network round-trip instead of a fanout to every shard.

A query that filters on no vindex column becomes a scatter — fan out to every shard, gather, merge. Scatters are correct but expensive: their tail latency is bounded by the slowest shard, and they consume connections everywhere. A well-designed vschema makes the 99% of your traffic single-shard or lookup-routed; only rare queries scatter.

Vindex types in production

hash: 8-byte FNV hash of an integer. binary_md5: for string columns. xxhash: non-cryptographic, ~5x faster than MD5. lookup_unique: 1:1 lookup table. lookup: 1:N lookup table. consistent_lookup: updated transactionally with the source row, no eventual-consistency window. numeric_static_map: static small-integer-to-shard mapping.

The Planning Workflow

How VTGate turns one SQL string into a tree of per-shard sub-queries.

Take SELECT name FROM users WHERE id = 42. The parser produces an AST. The analyzer notes that users is sharded with primary vindex hash(id). Gen4 planner observes the WHERE clause has an equality on the vindex column, computes hash(42) → keyspace_id, looks up which shard owns that ID, and emits a single-shard plan: send the query verbatim to that one VTTablet. This is the happy path — sub-millisecond planning, zero overhead beyond a hash and a map lookup.

Take SELECT name FROM users WHERE country = 'DE' ORDER BY created_at LIMIT 100. No primary vindex on country. Gen4 emits a scatter plan: send the query to every shard, each returning up to 100 rows sorted by created_at. VTGate runs a streaming N-way merge across the per-shard cursors and stops at 100 rows total. Crucially, each shard's sort pushes through to MySQL — VTGate never sorts more than necessary.

Take SELECT u.name, o.total FROM users u JOIN orders o ON u.id = o.user_id WHERE u.id = 42. Both tables share the same primary vindex on user_id. Gen4 recognizes co-location and sends the JOIN as a single SQL statement to the shard owning user 42. MySQL does the JOIN locally — no cross-shard work.

Take a JOIN that is cross-shard. Gen4 produces a join plan with a "build side" (often the smaller, lookup-routed side) executed first; the resulting key set is then pushed as an IN clause to the other side's shards. Vitess prefers this hash-style join over true cross-shard merges because it minimizes data movement and uses MySQL's existing optimizer for the per-shard work.

VReplication: The Heart of Online Operations

Almost every operational miracle in Vitess is, underneath, a VReplication workflow.

VReplication is a binlog-streaming subsystem inside VTTablet. A VReplication stream reads MySQL binlog events from a source tablet (using a regular MySQL replication connection), filters them (by table, by row predicate, by transformation), and applies them to a target tablet. It tracks GTID positions per stream so it can resume after restarts and detect lag.

MoveTables uses VReplication to move a set of tables from one keyspace to another. Reshard uses it to split or merge shards. Materialize uses it to build a derived table on a different sharding scheme — for example, a users_by_email table sharded on email so email lookups become single-shard. Online DDL uses it (combined with gh-ost or vitess-managed schema changes) to apply schema changes without locking large tables.

The lifecycle of every workflow is the same: Create (set up streams and target tables, copy phase backfills existing rows in chunks while the stream catches up on writes), SwitchTraffic (move reads first, then writes, with reverse replication enabled so old shards stay current), DropSources (delete the now-redundant streams and reclaim source-side data). Each step is a single vtctldclient command, idempotent, and rollback-able until the final drop.

Online Resharding via SwitchTraffic

The crown jewel: take one shard, split it into two, with no application-visible downtime.

You have a shard - (unsharded keyspace) running hot. You want -80 and 80-. The procedure:

Create the targets. Provision two new tablet sets for shards -80 and 80-, copying the source schema. They start empty.
Reshard --create. VReplication starts copy phase: reads source rows in primary-key chunks, writes them to the target shard determined by the primary vindex. While copy runs, every binlog event from the source is also streamed and held in queue.
Catch up. When copy finishes, the streams drain the queued binlog events and then track live writes within ~hundreds of ms.
VDiff. Optional but recommended: compare row counts and checksums between source and target shards while live to confirm correctness.
SwitchTraffic --tablet-types=rdonly,replica. VTGate begins routing read queries to the new shards. Writes still go to source.
SwitchTraffic --tablet-types=primary. Writes pause for a few seconds while in-flight transactions drain on source, the cutover GTID is recorded, and VTGate's routing rules update. Reverse replication starts: target → source, so the old shard stays warm.
DropSources. Once you trust the new layout, drop the source streams and shard. The reverse replication stops; rollback is no longer possible.

Total application-visible write pause: typically 2–10 seconds. Total operator wall-clock: hours to days for a large dataset, but the operator is just monitoring — no manual data movement.

Semi-Sync Replication and Failover

How Vitess keeps writes durable when a primary dies.

Vitess strongly recommends semi-sync replication on every shard. A primary cannot ack a commit to the client until at least one replica has written the binlog event to its relay log. This bounds the data-loss window: if the primary dies before the ack, the client did not get one; if the primary dies after, at least one replica has the data.

Failover is orchestrated by PlannedReparentShard (graceful, used for maintenance) or EmergencyReparentShard (the primary is gone). ERS picks the replica with the most up-to-date GTID position, applies any remaining events from surviving replicas via flashback, promotes it to PRIMARY in topo, and reparents the other replicas to follow it. VTGate's tablet health watcher notices the new PRIMARY within a second and routes new writes there. Open transactions on the old primary fail with a network error; the client retries.

For deployments that need automated failover without an operator pressing a button, VTOrc (Vitess Orchestrator) watches every shard and triggers ERS when its health checks declare a primary down. VTOrc is conservative — multi-region splits and topology partitions are common false-positive triggers, so VTOrc requires a quorum of replicas to agree before promoting.

Tradeoffs and When to Use Vitess

Vitess is not free. Be honest about the cost.

What you give up: cross-shard transactions are restricted (Vitess can do atomic commits across shards via 2PC, but the default best-effort mode is non-atomic). Foreign keys cannot span shards. AUTO_INCREMENT on a sharded table does not work; you use Vitess sequences instead. Some MySQL features (savepoints across savepoint resets, certain stored procedures, user-defined variables that persist across statements) need "reserved connections" which pin a client to one MySQL connection — slower and more memory-hungry than the pooled fast path.

What you take on: Vitess is a distributed system. You now run topo (etcd), VTGate, VTTablets, and possibly VTOrc — none of which existed in your single-MySQL world. Operator skill matters: you must understand vschema design, the difference between reparenting modes, what reverse replication does, and how to read vrepl_log.

Use Vitess when: your dataset cannot fit on the largest single MySQL machine, write throughput exceeds what one primary can absorb, or you need online resharding because your data layout will keep changing. Slack uses Vitess because the messaging fan-out doesn't fit one box. GitHub uses it for issues and PRs. PlanetScale productized it as a managed service.

Don't use Vitess when: your database fits in one MySQL with read replicas — the operational overhead is not worth it. If you need globally distributed transactions across regions, look at CockroachDB or Spanner; Vitess is single-region per keyspace by default. If you want a from-scratch distributed SQL engine, TiDB will feel cleaner.

Vitess vs PlanetScale vs TiDB

	Vitess	PlanetScale	TiDB
Backend storage	Real MySQL (InnoDB)	Real MySQL on Vitess (Metal: NVMe)	TiKV (Raft + RocksDB)
Sharding model	Manual vschema, online resharding	Same as Vitess + branching UX	Auto-sharded regions
Cross-shard txn	Best-effort default, optional 2PC	Same as Vitess	Native Percolator-style
Schema changes	VReplication-based online DDL	Branching + deploy requests	Online by default
Failover	Manual (PRS) or VTOrc	Managed	Automatic via PD + Raft
License	Apache 2.0 (CNCF graduated)	Proprietary SaaS	Apache 2.0
Best for	Self-hosted MySQL at scale	Teams that want zero ops	HTAP + auto-scale workloads

Frequently Asked Questions

Where did Vitess come from?

Vitess was built at YouTube starting in 2010 to keep MySQL alive as YouTube's metadata grew past what a single MySQL server (or even a small cluster) could handle. Sugu Sougoumarane and Mike Solomon led the project. It became open source in 2012, joined the CNCF in 2018, and graduated as a top-level CNCF project in 2019. Today it powers Slack, GitHub, HubSpot, JD.com, Etsy, Pinterest, and is the engine behind PlanetScale's managed offering.

How is Vitess different from running MySQL with read replicas?

Read replicas only solve read scaling. They cannot help write throughput, cannot grow disk capacity past what one primary holds, and force application code to choose which replica to query. Vitess adds horizontal sharding (split one logical keyspace across N MySQL servers), online resharding (move data between shards without downtime), schema migrations that run safely at scale, and a routing layer (VTGate) that hides the sharding from your app — your app still sees a single MySQL endpoint.

Do I need to rewrite my queries to work with Vitess?

Mostly no. VTGate parses MySQL syntax and routes queries automatically using the vschema. Single-shard queries (those filtering on the sharding key) just work. Cross-shard queries (scatter-gathers, JOINs that span shards) also work but pay a coordination cost. The queries you must rewrite are ones that depend on MySQL features Vitess restricts: cross-shard transactions, certain stored procedures, FK constraints across shards, and AUTO_INCREMENT (replaced by Vitess sequences).

How does Vitess perform an online resharding?

VReplication streams binlog events from the source shards to the target shards while the workflow runs. Once the targets are caught up within a few seconds of lag, an operator runs SwitchTraffic — first for reads, then for writes. Writes are briefly paused (seconds) while the binlog cutover lands and VTGate's routing rules update. Old shards remain available for rollback. Reverse replication keeps them in sync until you DropSources.

What is a vindex and why does it matter?

A vindex maps a column value to a keyspace ID, which determines the shard. Primary vindexes are unique and define how rows are sharded (e.g., hash(user_id)). Secondary vindexes let you route queries that filter on other columns by maintaining a lookup table. A scatter vindex means 'no vindex applies, query all shards.' Choosing primary vindexes well is the single most important design decision for a Vitess deployment — pick the wrong column and your most common queries become scatter-gathers.

How does Vitess compare to TiDB?

Both shard MySQL-compatible workloads, but the architecture is opposite. Vitess wraps real MySQL servers and adds a routing+management layer on top — you keep MySQL's storage engine, replication, and ecosystem. TiDB is a from-scratch distributed SQL engine on top of TiKV (a Raft-replicated KV store). TiDB gives you transparent global transactions and elastic scaling without manual resharding, but you give up MySQL's exact compatibility, its mature replication tooling, and the option to drop down to plain MySQL if needed.

Is PlanetScale just hosted Vitess?

Initially yes — PlanetScale was founded by Sugu and Jiten, the original Vitess creators, and shipped as managed Vitess. Over time PlanetScale built proprietary features on top (branching, deploy requests, Boost edge caching) and in 2024 announced PlanetScale Metal which replaced EBS with NVMe. PlanetScale still tracks Vitess upstream but the management plane and developer experience are theirs. If you want pure open-source Vitess, you run it yourself or use providers like Vitess.io's reference operator.