MySQL Internals

MySQL is the world's most-deployed relational database, and almost all of it is InnoDB — a sophisticated B+tree storage engine with clustered indexes, undo-based MVCC, a redo log that resembles a write-ahead log, and a doublewrite buffer for torn-write protection. Around InnoDB sits the MySQL server proper: the SQL parser, the optimizer, the binary log that drives replication, and the layer that exposes pluggable storage engines (legacy MyISAM, MEMORY, and a handful of less-used backends, all dwarfed by InnoDB). This page is the entry point — a hub linking to the deeper internals topics, with a substantive walkthrough of each major component and the contrasts with PostgreSQL where they differ.

InnoDB Architecture

Key Numbers

Default page

16 KB

Default isolation

REPEATABLE READ

Buffer pool

70-80% RAM

Doublewrite

2 MB

Redo log

5 GB default 8.0+

Row format

DYNAMIC

Engine share

~99% InnoDB

Redo log group

N concurrent

Why InnoDB Won

ACID by default

MyISAM was non-transactional; a crash mid-write could corrupt tables. InnoDB ships transactions, MVCC, crash safety. By the time MySQL 5.5 made InnoDB the default in 2010, MyISAM was already dead for new applications.

Clustered index

Row data lives in the primary key b+tree. Lookups by PK are one tree descent — fast and cache-friendly. The cost: secondary indexes go through the PK tree on each hit. For PK-dominated workloads (most OLTP), this is a clear win.

Pluggable but uniform

MySQL's storage engine API allowed alternatives, but in practice everyone uses InnoDB. New engines (TokuDB, MyRocks) tried to compete; InnoDB's tuning maturity wins on general-purpose workloads.

Clustered Index: The Primary B+Tree

Every InnoDB table is its primary key.

In InnoDB, a table with a primary key is stored as a b+tree where the leaves contain the actual row data. The keys in the leaves are PK values. There is no separate "heap" for rows — the b+tree IS the heap. Tables without an explicit primary key get a hidden 6-byte DB_ROW_ID auto-incremented per row.

CREATE TABLE users (
  id INT PRIMARY KEY,
  email VARCHAR(255) UNIQUE,
  name VARCHAR(100)
);

primary clustered btree (table itself):
  key   = id (INT)
  value = (id, email, name)         ← whole row in the leaf

secondary index on email:
  key   = email
  value = id                         ← PK as the row pointer

Reading by email: descend the secondary btree to find the PK, then descend the primary btree to find the row. Two trees touched per query. Covered index (every column the query needs is in the index): only the secondary tree.

Implication: PK choice matters. A monotonic INT PK keeps inserts at the rightmost btree leaf — sequential I/O, no page splits. A UUID PK (random) inserts everywhere — random I/O, frequent page splits, double the disk usage. Use BIGINT AUTO_INCREMENT or sequential time-based UUIDs (UUIDv7) for high-write tables.

MVCC via Undo Log

Old row versions live in undo, not the heap.

Every row in InnoDB has two hidden columns: DB_TRX_ID (6 bytes, the txn that last modified this row) and DB_ROLL_PTR (7 bytes, a pointer into the undo log). When a row is updated, the old version is written to undo with its previous TRX_ID and ROLL_PTR; the row in the data page is overwritten with the new version's TRX_ID and a ROLL_PTR pointing to the old undo entry.

SELECT under READ VIEW (snapshot):
  read row from data page → check its DB_TRX_ID
  if DB_TRX_ID is visible to my snapshot → use this version
  else → follow DB_ROLL_PTR into undo log
         retrieve previous version, check its TRX_ID
         repeat until visible (or end of chain)

Compare to PostgreSQL: PG keeps multiple versions in the heap (each row has xmin/xmax) and VACUUM removes obsolete ones. InnoDB keeps only the latest in the heap; older versions live in undo. PG's vacuum lag = bloat in the heap; InnoDB's undo retention = bloat in undo. Both have the long-running-transaction problem (old undo / old versions can't be cleaned).

Redo Log (the WAL)

Two ring-buffer files; group commit makes commits cheap.

InnoDB's redo log functions as the WAL. Two files (ib_logfile0, ib_logfile1 in 5.7-, or one file per group in 8.0+) total innodb_redo_log_capacity bytes (default 5 GB in 8.0+, much smaller before). Logs are organized as mini-transactions (mtr) — atomic groups of low-level page changes (e.g. one row insert plus the corresponding secondary index updates).

backend writes mtr log records to redo buffer (RAM)
  ↓ on commit:
backend appends commit marker, flushes redo buffer to log file
  ↓ if innodb_flush_log_at_trx_commit=1: fsync now
  ↓ otherwise: defer
return success to client

Group commit batches multiple commits' fsyncs together. Concurrent commits arriving within the same window all share one fsync. Combined with binlog group commit, MySQL can serve tens of thousands of commits per second on a single host with full durability.

Doublewrite Buffer

Defense against torn writes.

A 16 KB page write to disk is not atomic on most filesystems — power loss in the middle can leave the page half-written, corrupting it. InnoDB's defense: write each modified page first to the doublewrite buffer (a 2 MB sequential area in the system tablespace), fsync, then write to the page's actual location, fsync. On crash recovery, pages with bad checksums in the data file are restored from the doublewrite buffer.

Cost: every page write happens twice. On a write-heavy workload, this is significant. Modern alternatives:

- innodb_doublewrite=OFF: skip doublewrite. Safe ONLY if the storage
  guarantees atomic 16 KB writes (some enterprise NVMe with O_DIRECT
  and matching sector size; ZFS with recordsize=16K).

- atomic_writes (Percona/MariaDB extension): tell the kernel to use
  filesystem-level atomic writes. Same effect, no doublewrite.

For most cloud disks, leave doublewrite on — the safety is worth the throughput cost.

Change Buffer

Deferring secondary index updates when pages aren't in memory.

Updating a non-unique secondary index when its page isn't in the buffer pool would require reading the page (random I/O), modifying it, writing it back. The change buffer instead stages the change in a special area of the buffer pool: a record describing what to do. When the page is later read for any reason (a query, a checkpoint flush), the change buffer entries are merged into it.

UPDATE users SET email = 'new@x' WHERE id = 42;
  → primary btree page (in buffer pool) → modify directly
  → secondary index on email (page NOT in buffer pool):
       INSERT change buffer record for "delete old@x, insert new@x"
       continue without I/O

The win: write-heavy workloads with large secondary indexes that don't fit in RAM avoid tons of random reads. The cost: change buffer records consume buffer pool space; reads that have to wait for change buffer merge are slower; doesn't apply to unique indexes (which need immediate uniqueness check).

Adaptive Hash Index

An automatic in-memory hash for hot keys.

InnoDB monitors which b+tree paths are being walked frequently. For sufficiently hot keys, it builds an in-memory hash index that maps key → page-and-record-offset, skipping the b+tree descent. Lookups become O(1) for the hot set.

Tunable: innodb_adaptive_hash_index=ON (default). Disable on workloads with uniformly random access (no hot keys to cache). The hash itself uses a global mutex historically, which can be a contention bottleneck on many cores; modern versions split it into multiple partitions (innodb_adaptive_hash_index_parts).

Locking, Gap Locks, and REPEATABLE READ

Why MySQL's default isolation feels stricter than PostgreSQL's.

MySQL defaults to REPEATABLE READ; PostgreSQL defaults to READ COMMITTED. The visible difference: in REPEATABLE READ, a transaction's reads return the same data on repeat execution within the txn. In READ COMMITTED, every statement sees a fresh snapshot.

InnoDB's REPEATABLE READ adds gap locks on range queries with FOR UPDATE or FOR SHARE:

-- Suppose users has rows with id 10, 20, 30
SELECT * FROM users WHERE id BETWEEN 15 AND 25 FOR UPDATE;
-- locks the gaps: (10,20) and (20,30), plus the record at id=20
-- another txn cannot INSERT a row with id 16 while this txn holds locks
-- prevents phantom reads on re-execution of this query

Cost: gap locks contend more than record locks. Workloads with high INSERT rates against hot ranges deadlock more often. Many MySQL DBAs run with READ COMMITTED or READ COMMITTED + ROW binlog format for higher throughput at the cost of phantom risk in app code.

Binlog and Replication

A separate log driving everything beyond the local server.

The binary log (binlog) is a server-level log of every change, separate from InnoDB's redo log. Three formats:

STATEMENT — logs SQL statements; replicas re-execute them.
            Compact but unsafe for non-deterministic SQL (NOW(), UUID()).

ROW       — logs row-level changes (before/after images).
            Larger but deterministic; the modern default.

MIXED     — STATEMENT for safe queries, ROW for unsafe.
            Compromise; rarely used in production.

Replication: a replica connects, reads the master's binlog, applies the events. Modes:

async        — master commits independently of replicas (default)
semi-sync    — master waits for at least one replica to ACK before commit
sync (group  — master waits for quorum of replicas before commit
 replication)
GTID-based   — Global Transaction IDs (server-uuid:seq) instead of
               file/offset; cleaner failover

Parallel replication (since 5.7): replicas apply non-conflicting transactions in parallel, eliminating the single-threaded SQL apply bottleneck of older MySQL. slave_parallel_workers sets the worker count; slave_parallel_type=LOGICAL_CLOCK picks the modern dependency-aware scheduler.

MyISAM (Vestigial)

The original MySQL storage engine; rarely used now.

MyISAM was the default storage engine until 5.5 (2010). Three files per table: .MYD (data), .MYI (index), .frm (schema). Fast for read-only workloads (no MVCC overhead, simple b-tree indexes). No transactions, no row locking (only table-level), no crash safety.

Reasons it's still around:

- Some legacy databases never converted (running fine, why touch?)
- system tables in mysql schema were MyISAM until 8.0 (now InnoDB)
- FULLTEXT search was MyISAM-only until InnoDB got it in 5.6
- some niche tooling assumed MyISAM file layout

For new tables in 2024+: never use MyISAM. Use InnoDB.

MySQL vs PostgreSQL

Aspect	MySQL/InnoDB	PostgreSQL
Default engine	InnoDB	(only one — heap)
Index structure	clustered (PK = table)	heap + indexes point to TID
MVCC storage	undo log	multi-version heap + VACUUM
Default isolation	REPEATABLE READ	READ COMMITTED
Range lock model	gap + next-key locks	predicate locks (Serializable only)
WAL	redo log + binlog	single WAL
Replication	binlog (statement/row)	WAL streaming + logical decoding
Logical CDC	binlog with row format	logical decoding via slots
Read replicas	async or semi-sync	async, sync via standby_names
Window functions	since 8.0	since 8.4
JSON support	JSON type, generated cols	JSONB with GIN index
License	GPL (Oracle)	PostgreSQL (BSD-like)

Tuning Cheat Sheet

Variable	Recommended	Note
innodb_buffer_pool_size	70-80% RAM	most important knob
innodb_flush_log_at_trx_commit	1 (safe) or 2	2 loses ≤1s on crash, much faster
sync_binlog	1	fsync binlog per commit (with group commit it's cheap)
innodb_redo_log_capacity	~5-25 GB	bigger = fewer checkpoint stalls
innodb_io_capacity	match disk IOPS	SSD: 2000-20000
innodb_io_capacity_max	2× io_capacity	burst limit during catch-up
innodb_flush_method	O_DIRECT	avoid double-buffering with OS
innodb_doublewrite	ON (default)	OFF only with verified atomic writes
innodb_file_per_table	ON (default 5.6+)	per-table .ibd files
binlog_format	ROW	safe, deterministic

FAQ

Why does InnoDB store rows in primary-key order?

InnoDB uses a clustered index design: the primary key IS the table's b-tree, with row data living in the leaves. Looking up a row by primary key is one b-tree descent — no extra fetch. The cost: secondary indexes can't point to disk addresses (those would change on row movement); they store the primary key as the row pointer, requiring a second lookup into the primary tree on every secondary-index hit. PostgreSQL takes the opposite approach (heap files + indexes that point to physical row locations) — fewer lookups via secondary indexes, but no automatic ordering.

What is the change buffer for?

Defers updates to non-unique secondary indexes when their pages aren't in the buffer pool. Instead of reading the page, modifying it, writing it back — InnoDB writes a 'change buffer record' describing the modification. When the page is later read for any reason, the change buffer entries are merged in. For write-heavy workloads with large secondary indexes that don't fit in RAM, this dramatically reduces random I/O. Disabled for unique indexes (they need to validate uniqueness immediately). Visible in SHOW ENGINE INNODB STATUS as 'INSERT BUFFER AND ADAPTIVE HASH INDEX'.

What is the doublewrite buffer?

A 2 MB sequential file (or area in the system tablespace) where InnoDB writes pages BEFORE writing them to their final location in the data files. Why: the OS may write only part of a 16 KB page to disk on a power loss (a 'torn write'), corrupting the page. The doublewrite gives InnoDB a known-good copy to recover from on the next startup. Performance cost: every page write happens twice. Modern SSDs with atomic 16 KB writes (e.g. via O_DIRECT and matching device sector size) make doublewrite redundant; innodb_doublewrite=OFF is safe on such hardware.

Why is REPEATABLE READ the default isolation level?

Historical and pragmatic. MySQL's REPEATABLE READ is also serializable for the common cases — gap locks prevent phantom reads that would otherwise leak across the isolation. PostgreSQL defaults to READ COMMITTED, which is faster but allows phantom reads. MySQL chose stronger default safety; the cost is more locking on range queries. For high-throughput OLTP, many MySQL installs run with READ COMMITTED instead, accepting some phantoms in exchange for less lock contention.

What's a gap lock?

A lock on the range BETWEEN two existing index keys, not on a specific key. SELECT ... FOR UPDATE WHERE x BETWEEN 10 AND 20 takes gap locks at every gap between matching rows, plus next-key locks (gap + record) at the boundary rows. This prevents another transaction from inserting new rows in the range — preserving the result set if you re-execute. Gap locks are why MySQL's REPEATABLE READ avoids phantoms but can deadlock more aggressively. Disabled in READ COMMITTED isolation.

What is group commit in MySQL?

Multiple concurrent COMMITs are batched into a single fsync of the binlog (and a single fsync of the redo log). Without group commit, 1000 concurrent commits do 1000 fsyncs; with it, they do roughly 1. MySQL's binlog group commit landed in 5.6 with the parallel-group-commit-by-Mariadb design. innodb_flush_log_at_trx_commit and sync_binlog control durability — both 1 means full sync per commit (safe, slow); 0 or 2 trade durability for speed.

How does InnoDB do MVCC?

Every InnoDB row carries a 6-byte DB_TRX_ID (last txn that modified it) and 7-byte DB_ROLL_PTR (pointer into the undo log). When a row is updated, the old version is written to the undo log; the row in the data page is overwritten with the new version. Readers at older snapshots follow the DB_ROLL_PTR backward through undo entries to reconstruct the version they should see. Compare to PostgreSQL: PG keeps multiple row versions in the heap and VACUUM cleans them up. MySQL keeps only the latest in the heap; older versions are reconstructed on demand from undo.