MySQL Internals
MySQL is the world's most-deployed relational database, and almost all of it is InnoDB — a sophisticated B+tree storage engine with clustered indexes, undo-based MVCC, a redo log that resembles a write-ahead log, and a doublewrite buffer for torn-write protection. Around InnoDB sits the MySQL server proper: the SQL parser, the optimizer, the binary log that drives replication, and the layer that exposes pluggable storage engines (legacy MyISAM, MEMORY, and a handful of less-used backends, all dwarfed by InnoDB). This page is the entry point — a hub linking to the deeper internals topics, with a substantive walkthrough of each major component and the contrasts with PostgreSQL where they differ.
InnoDB Architecture
Key Numbers
Why InnoDB Won
Clustered Index: The Primary B+Tree
Every InnoDB table is its primary key.
In InnoDB, a table with a primary key is stored as a b+tree where the leaves contain the
actual row data. The keys in the leaves are PK values. There is no separate "heap" for
rows — the b+tree IS the heap. Tables without an explicit primary key get a hidden 6-byte
DB_ROW_ID auto-incremented per row.
CREATE TABLE users ( id INT PRIMARY KEY, email VARCHAR(255) UNIQUE, name VARCHAR(100) ); primary clustered btree (table itself): key = id (INT) value = (id, email, name) ← whole row in the leaf secondary index on email: key = email value = id ← PK as the row pointer
Reading by email: descend the secondary btree to find the PK, then descend the primary btree to find the row. Two trees touched per query. Covered index (every column the query needs is in the index): only the secondary tree.
Implication: PK choice matters. A monotonic INT PK keeps inserts at the rightmost btree leaf — sequential I/O, no page splits. A UUID PK (random) inserts everywhere — random I/O, frequent page splits, double the disk usage. Use BIGINT AUTO_INCREMENT or sequential time-based UUIDs (UUIDv7) for high-write tables.
MVCC via Undo Log
Old row versions live in undo, not the heap.
Every row in InnoDB has two hidden columns: DB_TRX_ID (6 bytes, the txn that
last modified this row) and DB_ROLL_PTR (7 bytes, a pointer into the undo
log). When a row is updated, the old version is written to undo with its previous TRX_ID
and ROLL_PTR; the row in the data page is overwritten with the new version's TRX_ID and a
ROLL_PTR pointing to the old undo entry.
SELECT under READ VIEW (snapshot):
read row from data page → check its DB_TRX_ID
if DB_TRX_ID is visible to my snapshot → use this version
else → follow DB_ROLL_PTR into undo log
retrieve previous version, check its TRX_ID
repeat until visible (or end of chain) Compare to PostgreSQL: PG keeps multiple versions in the heap (each row has xmin/xmax) and VACUUM removes obsolete ones. InnoDB keeps only the latest in the heap; older versions live in undo. PG's vacuum lag = bloat in the heap; InnoDB's undo retention = bloat in undo. Both have the long-running-transaction problem (old undo / old versions can't be cleaned).
Redo Log (the WAL)
Two ring-buffer files; group commit makes commits cheap.
InnoDB's redo log functions as the WAL. Two files (ib_logfile0,
ib_logfile1 in 5.7-, or one file per group in 8.0+) total
innodb_redo_log_capacity bytes (default 5 GB in 8.0+, much smaller before).
Logs are organized as mini-transactions (mtr) — atomic groups of low-level
page changes (e.g. one row insert plus the corresponding secondary index updates).
backend writes mtr log records to redo buffer (RAM) ↓ on commit: backend appends commit marker, flushes redo buffer to log file ↓ if innodb_flush_log_at_trx_commit=1: fsync now ↓ otherwise: defer return success to client
Group commit batches multiple commits' fsyncs together. Concurrent commits arriving within the same window all share one fsync. Combined with binlog group commit, MySQL can serve tens of thousands of commits per second on a single host with full durability.
Doublewrite Buffer
Defense against torn writes.
A 16 KB page write to disk is not atomic on most filesystems — power loss in the middle
can leave the page half-written, corrupting it. InnoDB's defense: write each modified page
first to the doublewrite buffer (a 2 MB sequential area in the system tablespace),
fsync, then write to the page's actual location, fsync. On
crash recovery, pages with bad checksums in the data file are restored from the
doublewrite buffer.
Cost: every page write happens twice. On a write-heavy workload, this is significant. Modern alternatives:
- innodb_doublewrite=OFF: skip doublewrite. Safe ONLY if the storage guarantees atomic 16 KB writes (some enterprise NVMe with O_DIRECT and matching sector size; ZFS with recordsize=16K). - atomic_writes (Percona/MariaDB extension): tell the kernel to use filesystem-level atomic writes. Same effect, no doublewrite.
For most cloud disks, leave doublewrite on — the safety is worth the throughput cost.
Change Buffer
Deferring secondary index updates when pages aren't in memory.
Updating a non-unique secondary index when its page isn't in the buffer pool would require reading the page (random I/O), modifying it, writing it back. The change buffer instead stages the change in a special area of the buffer pool: a record describing what to do. When the page is later read for any reason (a query, a checkpoint flush), the change buffer entries are merged into it.
UPDATE users SET email = 'new@x' WHERE id = 42;
→ primary btree page (in buffer pool) → modify directly
→ secondary index on email (page NOT in buffer pool):
INSERT change buffer record for "delete old@x, insert new@x"
continue without I/O The win: write-heavy workloads with large secondary indexes that don't fit in RAM avoid tons of random reads. The cost: change buffer records consume buffer pool space; reads that have to wait for change buffer merge are slower; doesn't apply to unique indexes (which need immediate uniqueness check).
Adaptive Hash Index
An automatic in-memory hash for hot keys.
InnoDB monitors which b+tree paths are being walked frequently. For sufficiently hot keys, it builds an in-memory hash index that maps key → page-and-record-offset, skipping the b+tree descent. Lookups become O(1) for the hot set.
Tunable: innodb_adaptive_hash_index=ON (default). Disable on workloads with
uniformly random access (no hot keys to cache). The hash itself uses a global mutex
historically, which can be a contention bottleneck on many cores; modern versions split
it into multiple partitions (innodb_adaptive_hash_index_parts).
Locking, Gap Locks, and REPEATABLE READ
Why MySQL's default isolation feels stricter than PostgreSQL's.
MySQL defaults to REPEATABLE READ; PostgreSQL defaults to READ COMMITTED. The visible difference: in REPEATABLE READ, a transaction's reads return the same data on repeat execution within the txn. In READ COMMITTED, every statement sees a fresh snapshot.
InnoDB's REPEATABLE READ adds gap locks on range queries with
FOR UPDATE or FOR SHARE:
-- Suppose users has rows with id 10, 20, 30 SELECT * FROM users WHERE id BETWEEN 15 AND 25 FOR UPDATE; -- locks the gaps: (10,20) and (20,30), plus the record at id=20 -- another txn cannot INSERT a row with id 16 while this txn holds locks -- prevents phantom reads on re-execution of this query
Cost: gap locks contend more than record locks. Workloads with high INSERT rates against
hot ranges deadlock more often. Many MySQL DBAs run with READ COMMITTED or
READ COMMITTED + ROW binlog format for higher throughput at the cost of
phantom risk in app code.
Binlog and Replication
A separate log driving everything beyond the local server.
The binary log (binlog) is a server-level log of every change, separate from InnoDB's redo log. Three formats:
STATEMENT — logs SQL statements; replicas re-execute them.
Compact but unsafe for non-deterministic SQL (NOW(), UUID()).
ROW — logs row-level changes (before/after images).
Larger but deterministic; the modern default.
MIXED — STATEMENT for safe queries, ROW for unsafe.
Compromise; rarely used in production. Replication: a replica connects, reads the master's binlog, applies the events. Modes:
async — master commits independently of replicas (default)
semi-sync — master waits for at least one replica to ACK before commit
sync (group — master waits for quorum of replicas before commit
replication)
GTID-based — Global Transaction IDs (server-uuid:seq) instead of
file/offset; cleaner failover
Parallel replication (since 5.7): replicas apply non-conflicting transactions in
parallel, eliminating the single-threaded SQL apply bottleneck of older MySQL.
slave_parallel_workers sets the worker count;
slave_parallel_type=LOGICAL_CLOCK picks the modern dependency-aware scheduler.
MyISAM (Vestigial)
The original MySQL storage engine; rarely used now.
MyISAM was the default storage engine until 5.5 (2010). Three files per table:
.MYD (data), .MYI (index), .frm (schema). Fast for
read-only workloads (no MVCC overhead, simple b-tree indexes). No transactions, no row
locking (only table-level), no crash safety.
Reasons it's still around:
- Some legacy databases never converted (running fine, why touch?) - system tables in mysql schema were MyISAM until 8.0 (now InnoDB) - FULLTEXT search was MyISAM-only until InnoDB got it in 5.6 - some niche tooling assumed MyISAM file layout For new tables in 2024+: never use MyISAM. Use InnoDB.
MySQL vs PostgreSQL
| Aspect | MySQL/InnoDB | PostgreSQL |
|---|---|---|
| Default engine | InnoDB | (only one — heap) |
| Index structure | clustered (PK = table) | heap + indexes point to TID |
| MVCC storage | undo log | multi-version heap + VACUUM |
| Default isolation | REPEATABLE READ | READ COMMITTED |
| Range lock model | gap + next-key locks | predicate locks (Serializable only) |
| WAL | redo log + binlog | single WAL |
| Replication | binlog (statement/row) | WAL streaming + logical decoding |
| Logical CDC | binlog with row format | logical decoding via slots |
| Read replicas | async or semi-sync | async, sync via standby_names |
| Window functions | since 8.0 | since 8.4 |
| JSON support | JSON type, generated cols | JSONB with GIN index |
| License | GPL (Oracle) | PostgreSQL (BSD-like) |
Tuning Cheat Sheet
| Variable | Recommended | Note |
|---|---|---|
| innodb_buffer_pool_size | 70-80% RAM | most important knob |
| innodb_flush_log_at_trx_commit | 1 (safe) or 2 | 2 loses ≤1s on crash, much faster |
| sync_binlog | 1 | fsync binlog per commit (with group commit it's cheap) |
| innodb_redo_log_capacity | ~5-25 GB | bigger = fewer checkpoint stalls |
| innodb_io_capacity | match disk IOPS | SSD: 2000-20000 |
| innodb_io_capacity_max | 2× io_capacity | burst limit during catch-up |
| innodb_flush_method | O_DIRECT | avoid double-buffering with OS |
| innodb_doublewrite | ON (default) | OFF only with verified atomic writes |
| innodb_file_per_table | ON (default 5.6+) | per-table .ibd files |
| binlog_format | ROW | safe, deterministic |
FAQ
Why does InnoDB store rows in primary-key order?
InnoDB uses a clustered index design: the primary key IS the table's b-tree, with row data living in the leaves. Looking up a row by primary key is one b-tree descent — no extra fetch. The cost: secondary indexes can't point to disk addresses (those would change on row movement); they store the primary key as the row pointer, requiring a second lookup into the primary tree on every secondary-index hit. PostgreSQL takes the opposite approach (heap files + indexes that point to physical row locations) — fewer lookups via secondary indexes, but no automatic ordering.
What is the change buffer for?
Defers updates to non-unique secondary indexes when their pages aren't in the buffer pool. Instead of reading the page, modifying it, writing it back — InnoDB writes a 'change buffer record' describing the modification. When the page is later read for any reason, the change buffer entries are merged in. For write-heavy workloads with large secondary indexes that don't fit in RAM, this dramatically reduces random I/O. Disabled for unique indexes (they need to validate uniqueness immediately). Visible in SHOW ENGINE INNODB STATUS as 'INSERT BUFFER AND ADAPTIVE HASH INDEX'.
What is the doublewrite buffer?
A 2 MB sequential file (or area in the system tablespace) where InnoDB writes pages BEFORE writing them to their final location in the data files. Why: the OS may write only part of a 16 KB page to disk on a power loss (a 'torn write'), corrupting the page. The doublewrite gives InnoDB a known-good copy to recover from on the next startup. Performance cost: every page write happens twice. Modern SSDs with atomic 16 KB writes (e.g. via O_DIRECT and matching device sector size) make doublewrite redundant; innodb_doublewrite=OFF is safe on such hardware.
Why is REPEATABLE READ the default isolation level?
Historical and pragmatic. MySQL's REPEATABLE READ is also serializable for the common cases — gap locks prevent phantom reads that would otherwise leak across the isolation. PostgreSQL defaults to READ COMMITTED, which is faster but allows phantom reads. MySQL chose stronger default safety; the cost is more locking on range queries. For high-throughput OLTP, many MySQL installs run with READ COMMITTED instead, accepting some phantoms in exchange for less lock contention.
What's a gap lock?
A lock on the range BETWEEN two existing index keys, not on a specific key. SELECT ... FOR UPDATE WHERE x BETWEEN 10 AND 20 takes gap locks at every gap between matching rows, plus next-key locks (gap + record) at the boundary rows. This prevents another transaction from inserting new rows in the range — preserving the result set if you re-execute. Gap locks are why MySQL's REPEATABLE READ avoids phantoms but can deadlock more aggressively. Disabled in READ COMMITTED isolation.
What is group commit in MySQL?
Multiple concurrent COMMITs are batched into a single fsync of the binlog (and a single fsync of the redo log). Without group commit, 1000 concurrent commits do 1000 fsyncs; with it, they do roughly 1. MySQL's binlog group commit landed in 5.6 with the parallel-group-commit-by-Mariadb design. innodb_flush_log_at_trx_commit and sync_binlog control durability — both 1 means full sync per commit (safe, slow); 0 or 2 trade durability for speed.
How does InnoDB do MVCC?
Every InnoDB row carries a 6-byte DB_TRX_ID (last txn that modified it) and 7-byte DB_ROLL_PTR (pointer into the undo log). When a row is updated, the old version is written to the undo log; the row in the data page is overwritten with the new version. Readers at older snapshots follow the DB_ROLL_PTR backward through undo entries to reconstruct the version they should see. Compare to PostgreSQL: PG keeps multiple row versions in the heap and VACUUM cleans them up. MySQL keeps only the latest in the heap; older versions are reconstructed on demand from undo.