Vitess Internals
Vitess is the database layer that lets a single MySQL deployment scale to thousands of nodes without sacrificing the operational maturity of MySQL itself. Born at YouTube in 2010 and now a graduated CNCF project, it sits in front of unmodified MySQL servers as a routing, management, and replication tier — turning N MySQL primaries into one logical database. Slack, GitHub, HubSpot, and PlanetScale all run on it.
Based on the Vitess source, the official Vitess documentation, and Sugu Sougoumarane's published talks on the YouTube origin story.
Vitess Architecture
Key Numbers
Why Vitess Exists
VTGate: The Stateless Query Router
Every connection from your application lands on a VTGate. It is the brain of the routing layer.
VTGate speaks the MySQL wire protocol on the front and gRPC to VTTablets on the back. It is stateless, which means you run a fleet of them behind a load balancer and lose none of them matter — restart any VTGate and connections reconnect elsewhere. The state it needs (vschema, shard map, tablet health) lives in the topology server and is cached in memory with a watch-driven refresh.
A query enters VTGate as raw SQL bytes and traverses four stages. The parser
builds a Vitess AST (a fork of the upstream MySQL parser, kept in go/vt/sqlparser).
The analyzer resolves identifiers, walks the vschema, and tags expressions
with their column origin. The planner uses Gen4 — Vitess's current planner —
to decide which shards must run which sub-queries. The executor dispatches
those sub-queries to VTTablets in parallel, then merges, sorts, aggregates, or limits
the results before returning rows to the client.
SELECT * FROM users WHERE id = ?
produces one cache entry no matter how many user IDs you query.
VTTablet: The Per-Shard Agent
Every MySQL server in the cluster has a VTTablet sidecar. Together they form a tablet.
VTTablet is a single Go process that lives next to mysqld on the same host. It mediates every connection: VTGate never opens a raw MySQL connection to a backend, only to a VTTablet, which proxies it. This gives Vitess hooks for connection pooling, hot-row protection, query timeouts, transaction caps, query consolidation (deduplicating identical reads in flight), and per-table ACLs — features MySQL itself does not provide.
A VTTablet has a type that the topology server publishes: PRIMARY, REPLICA, RDONLY (replica that can be taken offline for batch jobs), BACKUP, RESTORE, or SPARE. Type transitions drive operational behavior. Reparenting (failover) is a coordinated dance across VTTablets and the topology server: one tablet stops being PRIMARY, another applies its remaining binlog events from the old primary, then claims PRIMARY in topo and starts accepting writes.
VTTablet also runs the schema engine (caches the live schema and detects drift), the tx engine (manages transactions and the reserved-connection pool for transactions that hold MySQL session state), and the messaging engine (Vitess's polled-table queue feature). Inside it sits VReplication, the binlog-streaming subsystem covered later.
The Topology Server
Vitess's only stateful component besides MySQL itself.
Vitess does not invent a consensus system. It plugs into etcd, ZooKeeper, or Consul and stores three kinds of data: the keyspace+shard map (which shards exist, what their key ranges are), the vschema (sharding rules per table), and tablet records (one per tablet — type, hostname, ports, alias). Leadership is derived: the tablet record with type=PRIMARY for a shard is the primary; failover is a topo write.
All Vitess components watch topo. VTGates watch the shard map and vschema so query
routing reacts within milliseconds of an operator change. VTTablets watch their own record
to learn type transitions and keyspace-level config. vtctld, the cluster
management daemon, is the only writer for most operations — it serializes resharding,
reparenting, and schema changes.
The topology load is light: a few KB per tablet, watched by stable processes. A 1,000-shard Vitess cluster can run on a 3-node etcd ring sized for 100 MB of state. The hot path (query routing, transactions) never touches topo — only the control plane does.
Keyspaces, Shards, and Vindexes
The data model that turns "MySQL" into "many MySQLs that look like one."
A keyspace is a logical database — what you would have called a database
in plain MySQL. A keyspace can be unsharded (one shard, behaves like a normal MySQL
database) or sharded (split into N shards by a key range). Each shard owns a contiguous
range of the 8-byte keyspace ID space, written in hex: -80 means "from start
up to 0x80…", 80-c0 means "0x80… up to 0xc0…", and so on.
A vindex maps column values to keyspace IDs. The primary vindex
on each table is mandatory — it determines which shard a row lives on. Common primary
vindexes: hash (8-byte hash of an integer column, uniform distribution),
binary_md5 (MD5 of a string column), xxhash (faster non-crypto
hash). Tables that should travel together (e.g. a user's orders and that user's profile)
share the same primary vindex on their respective user_id columns, guaranteeing co-location.
Secondary vindexes let you route queries that filter on non-primary
columns. A lookup vindex stores (column_value → keyspace_id) in a separate
lookup table, often unsharded, often itself in Vitess. When VTGate sees
WHERE email = ? on a table where email is a lookup vindex, it queries the
lookup table first to learn which shard, then sends the actual query there — one network
round-trip instead of a fanout to every shard.
A query that filters on no vindex column becomes a scatter — fan out to every shard, gather, merge. Scatters are correct but expensive: their tail latency is bounded by the slowest shard, and they consume connections everywhere. A well-designed vschema makes the 99% of your traffic single-shard or lookup-routed; only rare queries scatter.
The Planning Workflow
How VTGate turns one SQL string into a tree of per-shard sub-queries.
Take SELECT name FROM users WHERE id = 42. The parser produces an AST.
The analyzer notes that users is sharded with primary vindex hash(id).
Gen4 planner observes the WHERE clause has an equality on the vindex column,
computes hash(42) → keyspace_id, looks up which shard owns that ID, and emits
a single-shard plan: send the query verbatim to that one VTTablet. This is the happy path
— sub-millisecond planning, zero overhead beyond a hash and a map lookup.
Take SELECT name FROM users WHERE country = 'DE' ORDER BY created_at LIMIT 100.
No primary vindex on country. Gen4 emits a scatter plan: send the query to every shard,
each returning up to 100 rows sorted by created_at. VTGate runs a streaming N-way merge
across the per-shard cursors and stops at 100 rows total. Crucially, each shard's sort
pushes through to MySQL — VTGate never sorts more than necessary.
Take SELECT u.name, o.total FROM users u JOIN orders o ON u.id = o.user_id WHERE u.id = 42.
Both tables share the same primary vindex on user_id. Gen4 recognizes co-location and
sends the JOIN as a single SQL statement to the shard owning user 42. MySQL does the
JOIN locally — no cross-shard work.
Take a JOIN that is cross-shard. Gen4 produces a join plan with a "build side" (often the smaller, lookup-routed side) executed first; the resulting key set is then pushed as an IN clause to the other side's shards. Vitess prefers this hash-style join over true cross-shard merges because it minimizes data movement and uses MySQL's existing optimizer for the per-shard work.
VReplication: The Heart of Online Operations
Almost every operational miracle in Vitess is, underneath, a VReplication workflow.
VReplication is a binlog-streaming subsystem inside VTTablet. A VReplication stream reads MySQL binlog events from a source tablet (using a regular MySQL replication connection), filters them (by table, by row predicate, by transformation), and applies them to a target tablet. It tracks GTID positions per stream so it can resume after restarts and detect lag.
MoveTables uses VReplication to move a set of tables from one keyspace
to another. Reshard uses it to split or merge shards. Materialize
uses it to build a derived table on a different sharding scheme — for example, a
users_by_email table sharded on email so email lookups become single-shard.
Online DDL uses it (combined with gh-ost or vitess-managed schema changes)
to apply schema changes without locking large tables.
The lifecycle of every workflow is the same: Create (set up streams and
target tables, copy phase backfills existing rows in chunks while the stream catches up
on writes), SwitchTraffic (move reads first, then writes, with reverse
replication enabled so old shards stay current), DropSources (delete
the now-redundant streams and reclaim source-side data). Each step is a single
vtctldclient command, idempotent, and rollback-able until the final drop.
Online Resharding via SwitchTraffic
The crown jewel: take one shard, split it into two, with no application-visible downtime.
You have a shard - (unsharded keyspace) running hot. You want
-80 and 80-. The procedure:
- Create the targets. Provision two new tablet sets for shards
-80and80-, copying the source schema. They start empty. - Reshard --create. VReplication starts copy phase: reads source rows in primary-key chunks, writes them to the target shard determined by the primary vindex. While copy runs, every binlog event from the source is also streamed and held in queue.
- Catch up. When copy finishes, the streams drain the queued binlog events and then track live writes within ~hundreds of ms.
- VDiff. Optional but recommended: compare row counts and checksums between source and target shards while live to confirm correctness.
- SwitchTraffic --tablet-types=rdonly,replica. VTGate begins routing read queries to the new shards. Writes still go to source.
- SwitchTraffic --tablet-types=primary. Writes pause for a few seconds while in-flight transactions drain on source, the cutover GTID is recorded, and VTGate's routing rules update. Reverse replication starts: target → source, so the old shard stays warm.
- DropSources. Once you trust the new layout, drop the source streams and shard. The reverse replication stops; rollback is no longer possible.
Total application-visible write pause: typically 2–10 seconds. Total operator wall-clock: hours to days for a large dataset, but the operator is just monitoring — no manual data movement.
Semi-Sync Replication and Failover
How Vitess keeps writes durable when a primary dies.
Vitess strongly recommends semi-sync replication on every shard. A primary cannot ack a commit to the client until at least one replica has written the binlog event to its relay log. This bounds the data-loss window: if the primary dies before the ack, the client did not get one; if the primary dies after, at least one replica has the data.
Failover is orchestrated by PlannedReparentShard (graceful, used for
maintenance) or EmergencyReparentShard (the primary is gone). ERS picks the
replica with the most up-to-date GTID position, applies any remaining events from
surviving replicas via flashback, promotes it to PRIMARY in topo, and reparents the
other replicas to follow it. VTGate's tablet health watcher notices the new PRIMARY
within a second and routes new writes there. Open transactions on the old primary fail
with a network error; the client retries.
For deployments that need automated failover without an operator pressing a button, VTOrc (Vitess Orchestrator) watches every shard and triggers ERS when its health checks declare a primary down. VTOrc is conservative — multi-region splits and topology partitions are common false-positive triggers, so VTOrc requires a quorum of replicas to agree before promoting.
Tradeoffs and When to Use Vitess
Vitess is not free. Be honest about the cost.
What you give up: cross-shard transactions are restricted (Vitess can do atomic commits across shards via 2PC, but the default best-effort mode is non-atomic). Foreign keys cannot span shards. AUTO_INCREMENT on a sharded table does not work; you use Vitess sequences instead. Some MySQL features (savepoints across savepoint resets, certain stored procedures, user-defined variables that persist across statements) need "reserved connections" which pin a client to one MySQL connection — slower and more memory-hungry than the pooled fast path.
What you take on: Vitess is a distributed system. You now run topo (etcd), VTGate, VTTablets, and possibly VTOrc — none of which existed in your single-MySQL world. Operator skill matters: you must understand vschema design, the difference between reparenting modes, what reverse replication does, and how to read vrepl_log.
Use Vitess when: your dataset cannot fit on the largest single MySQL machine, write throughput exceeds what one primary can absorb, or you need online resharding because your data layout will keep changing. Slack uses Vitess because the messaging fan-out doesn't fit one box. GitHub uses it for issues and PRs. PlanetScale productized it as a managed service.
Don't use Vitess when: your database fits in one MySQL with read replicas — the operational overhead is not worth it. If you need globally distributed transactions across regions, look at CockroachDB or Spanner; Vitess is single-region per keyspace by default. If you want a from-scratch distributed SQL engine, TiDB will feel cleaner.
Vitess vs PlanetScale vs TiDB
| Vitess | PlanetScale | TiDB | |
|---|---|---|---|
| Backend storage | Real MySQL (InnoDB) | Real MySQL on Vitess (Metal: NVMe) | TiKV (Raft + RocksDB) |
| Sharding model | Manual vschema, online resharding | Same as Vitess + branching UX | Auto-sharded regions |
| Cross-shard txn | Best-effort default, optional 2PC | Same as Vitess | Native Percolator-style |
| Schema changes | VReplication-based online DDL | Branching + deploy requests | Online by default |
| Failover | Manual (PRS) or VTOrc | Managed | Automatic via PD + Raft |
| License | Apache 2.0 (CNCF graduated) | Proprietary SaaS | Apache 2.0 |
| Best for | Self-hosted MySQL at scale | Teams that want zero ops | HTAP + auto-scale workloads |
Frequently Asked Questions
Where did Vitess come from?
Vitess was built at YouTube starting in 2010 to keep MySQL alive as YouTube's metadata grew past what a single MySQL server (or even a small cluster) could handle. Sugu Sougoumarane and Mike Solomon led the project. It became open source in 2012, joined the CNCF in 2018, and graduated as a top-level CNCF project in 2019. Today it powers Slack, GitHub, HubSpot, JD.com, Etsy, Pinterest, and is the engine behind PlanetScale's managed offering.
How is Vitess different from running MySQL with read replicas?
Read replicas only solve read scaling. They cannot help write throughput, cannot grow disk capacity past what one primary holds, and force application code to choose which replica to query. Vitess adds horizontal sharding (split one logical keyspace across N MySQL servers), online resharding (move data between shards without downtime), schema migrations that run safely at scale, and a routing layer (VTGate) that hides the sharding from your app — your app still sees a single MySQL endpoint.
Do I need to rewrite my queries to work with Vitess?
Mostly no. VTGate parses MySQL syntax and routes queries automatically using the vschema. Single-shard queries (those filtering on the sharding key) just work. Cross-shard queries (scatter-gathers, JOINs that span shards) also work but pay a coordination cost. The queries you must rewrite are ones that depend on MySQL features Vitess restricts: cross-shard transactions, certain stored procedures, FK constraints across shards, and AUTO_INCREMENT (replaced by Vitess sequences).
How does Vitess perform an online resharding?
VReplication streams binlog events from the source shards to the target shards while the workflow runs. Once the targets are caught up within a few seconds of lag, an operator runs SwitchTraffic — first for reads, then for writes. Writes are briefly paused (seconds) while the binlog cutover lands and VTGate's routing rules update. Old shards remain available for rollback. Reverse replication keeps them in sync until you DropSources.
What is a vindex and why does it matter?
A vindex maps a column value to a keyspace ID, which determines the shard. Primary vindexes are unique and define how rows are sharded (e.g., hash(user_id)). Secondary vindexes let you route queries that filter on other columns by maintaining a lookup table. A scatter vindex means 'no vindex applies, query all shards.' Choosing primary vindexes well is the single most important design decision for a Vitess deployment — pick the wrong column and your most common queries become scatter-gathers.
How does Vitess compare to TiDB?
Both shard MySQL-compatible workloads, but the architecture is opposite. Vitess wraps real MySQL servers and adds a routing+management layer on top — you keep MySQL's storage engine, replication, and ecosystem. TiDB is a from-scratch distributed SQL engine on top of TiKV (a Raft-replicated KV store). TiDB gives you transparent global transactions and elastic scaling without manual resharding, but you give up MySQL's exact compatibility, its mature replication tooling, and the option to drop down to plain MySQL if needed.
Is PlanetScale just hosted Vitess?
Initially yes — PlanetScale was founded by Sugu and Jiten, the original Vitess creators, and shipped as managed Vitess. Over time PlanetScale built proprietary features on top (branching, deploy requests, Boost edge caching) and in 2024 announced PlanetScale Metal which replaced EBS with NVMe. PlanetScale still tracks Vitess upstream but the management plane and developer experience are theirs. If you want pure open-source Vitess, you run it yourself or use providers like Vitess.io's reference operator.