Aurora DSQL Storage: How S3 and NVMe Split the Stack

Why the journal-on-durable-storage / NVMe-as-cache split is the architectural bet, not just a buzzword

By the Systems Explained editors · Reviewed for technical accuracy against Marc Brooker's DSQL vignettes and the AWS Aurora DSQL documentation · Published 2026-05-23

Most coverage of Amazon Aurora DSQL stops at the marketing line: "serverless, distributed, multi-region active-active, PostgreSQL-compatible." Read the AWS user guide and you get a list of components: relay, compute, transaction log, storage. That list hides the actual architectural decision. The decision is this: DSQL splits its storage stack into a journal that is the source of truth and a set of NVMe-backed storage replicas that are a derivable view. The journal is held in highly durable, replicated AWS storage (the same shape of guarantee S3 offers — 11 nines, cross-AZ); the NVMe replicas cache the recent version chain so the query processor can serve point reads in single-digit milliseconds. That one decision — not "disaggregated storage" in general — is what gives DSQL its no-failover-latency, no-quorum-replication, multi-region active-active behavior. And it has a cost. This post walks through how the split actually works, why it does the things its competitors can't, and what query shapes pay for it in tail latency.

The two-tier storage stack

Three layers, but only one of them owns truth.

In Marc Brooker's "DSQL Vignette: Transactions and Durability", the architecture is described in deliberately functional terms: there is a journal, an "internal component we've been building at AWS for nearly a decade, optimized for ordered data replication across hosts, AZs, and regions"; there is a query processor (compute) that holds no durable state; and there are storage replicas that consume the journal and serve reads. Pin that picture in your head and most of the architecture follows. The journal accepts batches of committed transactions in commit-timestamp order. Storage replicas subscribe to the journal and apply the changes, producing the materialized version chains that point reads and short scans hit. Compute is stateless: any session can be served by any query processor, and the query processor's only durable interaction is to ask storage for the version visible at a given timestamp.

The S3/NVMe framing is the operational shape of those abstractions. The journal's durability target is S3-class — committed transactions are replicated across at least three Availability Zones with the durability guarantee customers expect from AWS's most durable primitives — and AWS has been clear in public talks (the re:Invent 2024 DAT427 deep-dive in particular) that storage replicas run on NVMe-backed hosts so reads can be served at memory-speed for the hot working set. The durable copy is the slow, cheap, geo-replicated one. The fast copy is the in-region, NVMe-resident, derivable one. If you only remember one thing about DSQL storage, remember that order: journal first, replicas second.

This is a different shape from Aurora classic (Verbitski et al., SIGMOD 2017), where the "log is the database" and the storage layer is six replicas across three AZs in one region, all of them online and quorum-written on the fast path. Aurora classic does not have a cold tier; every byte of every page lives on hot storage. DSQL has a cold tier by design — and that cold tier is what lets compute and storage replicas be torn down and recreated cheaply.

Which queries hit which tier

A point read against last-second data and a scan across 30 days of history are not the same query, even if their SQL looks identical.

A read in DSQL is parameterized by a snapshot timestamp. The query processor sends the read down to a storage replica with that timestamp, and storage answers with the version of each row visible at that point. Whether the answer is fast or slow depends entirely on what's still materialized on the NVMe-resident replica. Recent rows, recent version chains, and the working set of hot partitions sit on NVMe; the storage node serves them with the latency profile of a local SSD read plus network. Public talks and the Brooker "Reads and Compute" vignette put that path in the single-digit-millisecond range for the common case.

A read that crosses the NVMe retention horizon — historical data, a long scan over an old partition, a time-travel query against a timestamp that is no longer materialized — has to reconstruct rows from the journal. That path is fundamentally different. It involves reading older journal segments from durable storage, applying the version-chain reconstruction the storage replica would normally have cached, and returning the result. The latency budget for that path is dominated by the round-trip to S3-class storage and journal-segment scan time, which sits in the tens-to-low-hundreds-of-milliseconds range for first-byte latency under typical conditions. Two different reads, two different latency distributions, the same SQL.

This is the bimodal behavior the marketing diagrams don't show. Most production workloads on DSQL will sit overwhelmingly on the NVMe side: OLTP point lookups, idempotent writes, short transactions over a small working set. But query shapes that look equally innocent — a SELECT with a wide time range, a backfill scan, an analytics-style aggregation over an old partition — sit on the other side of the crossover and pay the journal-reconstruction tax on every row. DSQL is not an OLAP engine and is not designed to be one; this is the architectural reason why.

Interactive: which tier serves this query?

Drag "Query age" to see when a read drops off NVMe and starts paying the journal-reconstruction tax. The second slider tunes the retention window.

Query age (time between commit and read): 10 sec

NVMe retention window: 1 hour

Compute (stateless)

Query processor — never holds durable state

NVMe replicas (cache)

Recent rows materialized — ~5 ms reads

Journal (durable, S3-class)

Cold reads reconstructed here — ~120 ms first byte

A 10-second-old read hits NVMe. Estimated latency: 5 ms. Tier: NVMe-resident replica.

The rebalance protocol: how a row gets demoted to journal-only

If the journal is the source of truth and NVMe is a cache, eviction has to be safe by construction.

Walk a single row through its lifecycle. A client calls INSERT. The query processor accumulates the write locally — no replication, no durability — and at COMMIT time hands the write batch to the journal subsystem with a commit timestamp produced by AWS's TimeSync (the same time service that backs the Aurora and Spanner-equivalent commit-timestamp protocols on AWS hardware). The journal appends the batch and, only after the journal has confirmed the batch is durably stored across AZs, returns success to the query processor, which returns success to the client. At this point the row is committed and durable, but no NVMe replica may have applied it yet. The journal is the truth; everything else catches up.

Storage replicas subscribe to the journal stream. They consume committed batches in commit-timestamp order and apply them to their local materialized state on NVMe — producing the new version of each affected row alongside the existing version chain (MVCC). Because the journal hands the storage replicas a totally ordered stream of committed transactions, the replicas do not need 2PC, do not need a consensus quorum among themselves, and do not need leader election. They are pure followers of a totally ordered log. This is the structural reason DSQL avoids the leader-election stalls that Spanner's Paxos groups and CockroachDB's per-range Raft groups run into under partial failure.

Eviction works in the other direction. Once a row's age exceeds the NVMe retention threshold for its partition, and the storage replica has confirmed (via the journal's durability ack) that the canonical copy is safely in the journal, the NVMe-resident materialization can be dropped. There is no special compaction step as you would see in an LSM like RocksDB; the NVMe layout is a derivable view, and "evict" just means "stop caching, fall back to journal reconstruction on the next read." If a storage replica dies entirely, a new one is born by replaying journal segments from a snapshot — there is no streaming-from-a-peer recovery protocol like Cockroach's range-snapshot transfer, because the journal already holds an authoritative log.

That asymmetry is the operational payoff. In Cockroach, when a node dies the surviving replicas have to catch up the new replacement node by streaming state. In Spanner, a tablet whose Paxos group loses a member has to re-elect and stream. In Aurora classic, the storage fleet has six in-region copies and a quorum write protocol that does not tolerate losing a copy across regions. In DSQL, any storage replica is cheap to throw away and cheap to rebuild from the journal. That is the single sentence that explains the operational consequences of the split.

Consistency across the split

A read at timestamp T must see the same version regardless of which tier serves it.

The risk in any tiered storage design is that the tiers fall out of sync and the user sees a stale read from the wrong one. DSQL avoids that by making the journal the only thing that defines what is visible. Every transaction commits at a TimeSync-issued timestamp; the journal accepts batches in that timestamp order; storage replicas apply them in that timestamp order. A read at timestamp T is, by construction, the result of applying every committed transaction up to T — regardless of whether the storage replica has already materialized those changes on NVMe or has to reconstruct them from the journal. If the replica has the version visible at T cached, it returns it. If not, it reconstructs it. Either way, the answer is the same.

Snapshot isolation falls out of this naturally. Each transaction picks a start timestamp; reads inside the transaction return the row versions visible at that start; the optimistic concurrency control adjudicator at commit time checks for write-write conflicts between the start and commit timestamps and aborts if there is one. The AWS documentation confirms strong consistency and snapshot isolation as the contract, with the same isolation level guaranteed across AZs and across peered regions. The MVCC version chain is not a side effect — it is the data structure that makes the consistency story work, because it lets the storage layer answer "what did this row look like at timestamp T" without coordination.

Important corollary: this is not eventual consistency despite the tiered architecture. The user-visible value at timestamp T is fully determined by the journal up to T. The NVMe materialization state is internal bookkeeping. A user-visible read is allowed to be slower when the materialization is incomplete, but it is never allowed to be wrong. The journal-is-truth invariant is the load-bearing rule.

How Aurora classic, Spanner, and CockroachDB do it differently

Four databases, four storage philosophies. The shape of the storage layer is the shape of the database.

Property	Aurora classic	DSQL	Spanner	CockroachDB
Where canonical state lives	6-way quorum across 3 AZs in one region (shared storage)	Journal in S3-class durable storage, replicated across AZs/regions	Tablets on Colossus (Google's distributed FS)	Per-range Pebble (LSM) on local disk on each replica
Cold tier?	No — all data hot	Yes — journal is the cold tier; NVMe replicas are derivable cache	No — Colossus is the only tier	No — all state on local disk on the replicas
Recovery primitive	Repair a damaged segment from quorum peers	Replay journal from durable storage into a new replica	Paxos re-election + tablet move	Raft re-election + range snapshot transfer
Multi-region active-active	No — single-region only	Yes — peered clusters in a region set	Yes — global Paxos groups	Optional — multi-region replicas, but coordination cost grows
Failure-recovery cost	Low for single-AZ loss; not designed for region loss	Cheap — any storage replica is disposable, reborn from journal	Re-election stall, then catch-up	Raft stall, then state-streaming catch-up
Cold-read tail latency	Uniform — all data on hot storage	Bimodal — NVMe fast, journal-reconstruction slow	Uniform — Colossus latency for everything	Uniform — local-disk latency for everything

The bet DSQL is making is that the bimodal latency profile is worth it because the cold side is also the only side that has to be durable across regions. Aurora classic's quorum is fast because every copy is local; the same design simply doesn't extend to a six-way quorum across continents — the speed of light is in the way. Spanner's TrueTime-coordinated Paxos groups give you global writes but pay a coordination cost on every write. CockroachDB's per-range Raft groups give you flexible deployment but force a node-replacement protocol that has to stream state from a healthy peer, and lose enough peers and the range becomes unavailable until a snapshot can be moved. DSQL sidesteps all of that by saying: the durable copy lives in storage that is already replicated geo-redundantly, and the fast copies are cheap, in-region, and disposable.

Importantly, this is not free. Each of the other three has design wins DSQL gives up. Aurora classic has uniform low latency and a mature PostgreSQL feature surface DSQL is still catching up to. Spanner has truly global transactions with bounded staleness backed by TrueTime hardware. CockroachDB lets you choose your replica placement explicitly and run on your own hardware. DSQL traded those for elasticity and recovery cheapness. Whether that is the right trade depends on what your workload actually does.

Operational consequences readers can use

When to pick DSQL, when to actively avoid it.

Pick DSQL when your workload looks like: OLTP, mostly point reads and short transactions, working set that fits in the hot tier, demand for active-active multi-region without writing a custom failover orchestrator, and an operations team that does not want to think about quorum sizing, replica placement, or version upgrades. The recovery story alone — any storage node is cheap to recreate, any region is independent — is worth the price of admission for many teams that previously had to staff a 24/7 database operations rotation.

Avoid DSQL when your workload looks like: long scans over historical data, analytical queries, workloads where p99 latency on cold data is part of the SLA, or sustained per-row throughput against partitions older than the hot window. The bimodal latency profile will surface as customer-visible jank in those cases, and there is no per-table cache-pinning knob you can turn to fix it. If you find yourself wanting that knob, you are running the wrong workload on DSQL — feed the data into Redshift, Athena, ClickHouse, or DuckDB via change data capture and let the OLAP engine do what it is good at.

Takeaways

The journal is truth. NVMe is cache. Compute is stateless. Memorize that single sentence and almost every DSQL architectural decision falls out of it: why failover is free, why any storage node is disposable, why long historical scans are slow, why there is no quorum-write protocol on the fast path.
DSQL trades cold-read tail latency for elastic recovery and multi-region. The split is the bet. Pure-NVMe systems (CockroachDB, Aurora classic) win on uniform latency and lose on the recovery and geo-distribution stories DSQL was built for.
Aurora classic and CockroachDB cannot borrow the trick without giving up their own design wins. Aurora classic's quorum-on-fast-storage is a different bet for a different workload; CockroachDB's local-disk Raft is a different bet for a different deployment model. DSQL only makes sense given AWS's specific stack — TimeSync, the journal-as-an-internal-service, and S3-class durable storage as a primitive.

FAQ

Does Aurora DSQL replicate writes to S3 synchronously before the COMMIT returns?

Yes. A COMMIT is acknowledged to the client only after the write batch has been appended to the journal and the journal has confirmed durability across multiple Availability Zones. The journal is the source of truth in DSQL, and AWS positions the journal as the gating durability primitive — not the materialized NVMe-resident replicas. The NVMe-backed storage replicas catch up asynchronously after the journal accepts the batch, so a storage replica can fall behind, fail, or be re-created without affecting whether a transaction is durable.

How long does data stay on the NVMe storage tier before it ages out to journal-only?

AWS has not published a fixed retention window, and the value is internally tuned per partition and per workload. What is documented publicly is the shape of the policy: recent rows and the recent version chain stay materialized on NVMe-backed storage replicas; sufficiently old versions and partitions that no longer fit the working set get evicted, and any future read against them must be reconstructed from the journal. Treat the window as a tuning parameter held by AWS, not a documented SLA.

Can I tune the storage tier split myself as an Aurora DSQL customer?

No. DSQL is serverless. You cannot pin a table to NVMe, configure cache sizes, choose replica counts, or specify a retention window. The split is fully managed by AWS. The right knob, if you need predictability for a query, is workload design: keep hot working sets small, avoid long scans against historical data, and treat any scan that crosses the recent-data horizon as a cold-tier query with the latency penalty that implies.

How does this differ from Aurora classic's six-way quorum to shared storage?

Aurora classic also separates compute from storage, but its storage layer is six copies of the data across three AZs in a single region, written via a quorum protocol over private fiber. There is no S3 cold tier, no journal-then-materialize split, and no multi-region active-active. Aurora classic optimizes for a single-region, lots-of-RAM, low-latency OLTP workload with one writer at a time. DSQL keeps the journal in S3-class durable storage, treats the NVMe replicas as derivable views, and pays for that flexibility with bimodal read latency.

Is Aurora DSQL suitable for analytical or HTAP workloads?

No. DSQL is a transactional database. Long scans against cold partitions pay the journal-reconstruction tax on every read. There is no columnar storage, no vectorized execution, no separate analytics replica. AWS positions analytics as a job for Redshift, Athena over S3, or an OLAP engine you feed from DSQL via change data capture. If your workload reads more rows than it writes by an order of magnitude, DSQL is the wrong tool.