DynamoDB Internals

DynamoDB is AWS's managed key-value and document store, the production-hardened descendant of the 2007 Dynamo paper. Underneath the simple API is a system that partitions data across thousands of storage nodes, replicates synchronously across three availability zones, scales capacity in seconds, and exposes a deceptively narrow interface — get, put, query, scan. The trick to running it well is understanding what those four operations cost in IOPS, partitions, and dollars, and how the partition key model forces you to design data layouts that look nothing like a normalized relational schema.

Reflects DynamoDB behavior as of 2025: on-demand mode, adaptive capacity, transactional writes, DAX caching, and the standard global tables with multi-region replication.

Why DynamoDB Exists

The Gap

Amazon's 2004 holiday season hit a wall: relational databases could not scale writes for shopping carts and session stores without painful sharding, eventual-consistency hacks, and on-call agony. The 2007 Dynamo paper described the answer; DynamoDB (2012) productized it.

The Insight

Partitioning by a hash of a chosen key, replicating synchronously across three zones, and exposing only the operations the partition layout supports cheaply gives you predictable single-digit-millisecond latency at any scale. Trade joins and ad-hoc queries for guaranteed performance.

The Result

A serverless database that handles 89 million requests per second at the 2024 Prime Day peak, with sub-10 ms p99 reads, no DBA, no version upgrades, no shard reconfiguration. You provision a table, design a key, and ship. The cost is a permanently constrained query model.

Partition Architecture

Key Numbers

Max item size

400 KB

Per-partition write

1000 WCU/s

Per-partition read

3000 RCU/s

Txn write items

100 / 4 MB

Replication AZs

p99 read

~10 ms

DAX cache hit

~1 ms

The Partition Key + Sort Key Model

Every DynamoDB item is identified by a partition key (PK), with an optional sort key (SK). Together they form the primary key. The PK is hashed to determine which partition stores the item; the SK is stored sorted within that partition. This is the entire data model. There are no joins, no foreign keys, no secondary indexes that span partitions transparently.

Table: orders
PK = customer_id (string)
SK = order_id    (string)

customer_id     | order_id           | total | status     | items
----------------+--------------------+-------+------------+-------
"cust-42"       | "2025-05-01#001"   | 99.50 | "shipped"  | [...]
"cust-42"       | "2025-05-01#002"   | 12.00 | "pending"  | [...]
"cust-42"       | "2025-05-02#001"   | 45.00 | "shipped"  | [...]
"cust-9"        | "2025-04-30#001"   | 220.0 | "shipped"  | [...]

The supported access patterns drop directly out of this layout:

GetItem by PK + SK — single-partition O(1) lookup, ~5 ms p99.
Query by PK with optional SK condition — single-partition range scan, cheap and predictable.
Scan across the whole table — every partition, every item. Fine for small tables, never for hot paths.

What you cannot do natively: "find all orders with total > 100" (needs a scan), "join orders to customers" (no joins), "search orders by status" (needs a GSI). Every such pattern requires either an index or a different table layout. Choose the partition key by predicting the single most important access pattern, because that is the one that will be free.

The Hot Partition Problem and Adaptive Capacity

DynamoDB partitions cap at 1000 WCU and 3000 RCU each. If your traffic skews to a single partition key, you hit those limits regardless of your table's provisioned throughput. The classic anti-pattern: PK = "current_day", every write goes to one partition, the table sits at 1% utilization while throttling 99% of writes.

Adaptive capacity (introduced 2017, made instantaneous in 2019) mitigates this. The router monitors per-partition heat and reallocates throughput from cold partitions to hot ones in real time. There is also partition splitting on heat: a partition consistently over its limit will be split, redistributing items by sort key range across two new partitions. Both mechanisms are transparent — no API call, no downtime — but they have limits. A single partition key cannot exceed 1000 WCU/3000 RCU even with adaptive capacity, because all items with the same PK must live on the same partition.

// Anti-pattern: every write hits one partition
PK = "2025-05-03"   // today
SK = randomUUID

// Fix: write sharding
const SHARDS = 16;
PK = "2025-05-03#" + (hash(item_id) % SHARDS);   // 16 partitions instead of 1
SK = randomUUID

// Reads now require a fan-out across N shards, but writes scale linearly

Write sharding is the standard remedy for known-hot keys (e.g. timestamp-based event streams, a single popular product's inventory). The number of shards is a tradeoff: more shards means more parallelism on writes but more parallel queries on reads.

On-Demand vs Provisioned Capacity

DynamoDB exposes two billing models. Provisioned mode requires you to declare WCUs and RCUs in advance; throttling kicks in past your provisioned limit. On-demand mode auto-scales and bills per request. The mental model:

Provisioned — Cheaper at steady utilization above ~30%. Auto-scaling can track traffic, but with minutes of lag and a doubling cap. Best when traffic is predictable.
On-Demand — 6-7x more expensive per request, but pays only for what you use. Handles spikes from zero to peak in seconds (after warming). Best for spiky workloads, dev/test, and unknown traffic profiles.
WCU = 1 KB write; transactional writes count double; large items count up.
RCU = 4 KB strongly consistent read; eventually consistent reads are half; transactional reads count double.

A common gotcha: on-demand has a "previous peak doubled" warm capacity. A new on-demand table can immediately serve 4000 WCU and 12,000 RCU. To safely exceed those, traffic must ramp; otherwise you'll throttle until the table warms.

Global and Local Secondary Indexes

Indexes give you alternative access patterns at the cost of extra storage and write amplification.

LSI (Local Secondary Index) — Same PK as the table, different SK. Strongly consistent reads possible. Must be defined at table creation. Limited to 5 per table. Adds to the 10 GB collection size limit per partition key.
GSI (Global Secondary Index) — Different PK and SK. Eventually consistent only. Can be added after creation. Limited to 20 per table. Has its own provisioned capacity. Writes to the base table propagate asynchronously to GSIs.

// Base table: orders
PK = customer_id, SK = order_id

// GSI on status: "all PENDING orders across all customers"
PK = status,      SK = order_date

// GSI for inverted relationship: "find customer of an order_id"
PK = order_id,    SK = customer_id

GSIs are the real workhorse and the source of most surprise costs. Every write to the base table is double-billed when there is a GSI: one WCU for the base, one WCU for each affected GSI. Sparse indexes (where most items lack the indexed attribute) are a useful trick — only items with a non-null indexed attribute are written to the GSI, dramatically reducing its size and cost.

Single-Table Design

The Rick Houlihan philosophy: pack heterogeneous entity types into one table, using carefully chosen PK/SK conventions so you can read related data with a single Query. This minimizes round-trips, leverages DynamoDB's strengths (single-partition queries are cheap), and avoids the cross-table consistency problems of multi-table designs.

Table: app_data

PK              | SK                       | Type     | Attributes...
----------------+--------------------------+----------+-------------------
"USER#42"       | "PROFILE"                | profile  | email, name, plan
"USER#42"       | "ORDER#2025-05-01#001"   | order    | total, status
"USER#42"       | "ORDER#2025-05-01#002"   | order    | total, status
"USER#42"       | "ADDR#shipping"          | address  | street, city
"PROD#sku-7"    | "DETAIL"                 | product  | name, price
"PROD#sku-7"    | "REVIEW#2025-04-19#u9"   | review   | stars, body

// Query: all data for user 42
Query  PK = "USER#42"

// Query: just orders for user 42 in May
Query  PK = "USER#42" AND begins_with(SK, "ORDER#2025-05-")

Using a generic PK/SK name (rather than entity-specific) lets you define GSIs that span entity types. A GSI on (GSI1PK, GSI1SK) can index orders by status while also indexing reviews by date. The cost of single-table design is steep cognitive overhead — readers must decode the prefix-pattern to understand the schema, and refactoring access patterns later is painful. Use it when access patterns are well-understood; reach for multiple tables when they aren't.

Conditional Writes and Transactions

DynamoDB supports optimistic concurrency through ConditionExpression:

// Increment a counter only if its current value matches expected
UpdateItem  Key = { PK: "USER#42", SK: "COUNTER" }
            UpdateExpression = "SET v = v + :inc"
            ConditionExpression = "v = :expected"
            ExpressionAttributeValues = { :inc: 1, :expected: 41 }

// Returns ConditionalCheckFailedException if v != 41

For multi-item atomicity, TransactWriteItems wraps up to 100 writes (subject to a 4 MB total size) into a serializable transaction across multiple items, multiple tables, even multiple partitions. TransactGetItems does the same for reads. Transactions cost double the WCU/RCU of equivalent non-transactional operations because of the two-phase commit protocol underneath.

TransactWriteItems
  Put     orders     PK="USER#42" SK="ORDER#1"   Condition: "attribute_not_exists(PK)"
  Update  inventory  PK="PROD#7"  SK="STOCK"     SET qty = qty - :n  Condition: "qty >= :n"
  Update  ledger     PK="USER#42" SK="BAL"       SET bal = bal - :total

Transactions are how you build a real e-commerce checkout in DynamoDB. The 100-item cap is the practical limit of how much you can stitch together; beyond it, you fall back to event-driven reconciliation through DynamoDB Streams.

DynamoDB Streams

Streams are an ordered, change-data-capture log of every modification to a table item, retained for 24 hours. Each stream record contains the keys, the previous image, the new image, or both (configurable). They are partition-ordered: changes to the same item arrive in order, but changes to different items can interleave.

// Stream record example (NEW_AND_OLD_IMAGES)
{
  "eventID":   "abc123",
  "eventName": "MODIFY",
  "dynamodb": {
    "Keys":         { "PK": "USER#42", "SK": "ORDER#001" },
    "OldImage":     { "PK": "USER#42", "SK": "ORDER#001", "status": "pending"  },
    "NewImage":     { "PK": "USER#42", "SK": "ORDER#001", "status": "shipped"  },
    "SequenceNumber": "1234567890",
    "ApproximateCreationDateTime": 1714723200
  }
}

The standard pattern: a Lambda function subscribes to the stream and reacts to changes — writing to a search index, denormalizing into another DynamoDB table, publishing to SNS, maintaining materialized views. Streams turn DynamoDB into the event source of an event-driven architecture without bolting on Kafka. Failure handling matters: Lambda retries on failure with exponential backoff, but partition workers will block on a poison record until you wire up a dead-letter queue.

DAX — DynamoDB Accelerator

DAX is an in-memory, write-through caching layer for DynamoDB, API-compatible with the standard SDK. It runs as a multi-node cluster in your VPC and intercepts reads. Cache hits return in under a millisecond. Two caches inside each node:

Item cache — Caches GetItem/BatchGetItem responses by primary key.
Query cache — Caches Query/Scan responses by query parameters.

Both default to a 5-minute TTL, configurable. Writes go through DAX to DynamoDB synchronously, and on success the item is invalidated from the cache. The right use case is read-heavy workloads with cacheable hot keys; the wrong use case is anything that depends on strongly-consistent reads (DAX is eventually consistent against the underlying table).

The 400 KB Limit and What It Forces

No DynamoDB item may exceed 400 KB total — keys, attributes, and overhead. This is not a default that can be raised. Three patterns cope:

Vertical sharding — Split a logical document across multiple items keyed by an "ITEM#1", "ITEM#2" sort-key suffix.
S3 offload — Store the bulky payload in S3 and keep only a pointer (object key) in DynamoDB. S3 has its own 5 TB limit.
Compression — gzip or zstd a JSON blob before writing. Saves both size and WCU (since WCUs are billed per KB of item size).

Combined with the 1 MB Query response cap, the 400 KB limit pushes designs toward many small items rather than few large ones. The pagination flow is built into the API — every Query and Scan response includes a LastEvaluatedKey the client passes back to continue.

Backups, PITR, and Global Tables

Three durability features layer on top of the core engine:

On-demand backups — A consistent point-in-time snapshot stored in DynamoDB's managed backup service. Copies are instant (metadata only) and survive table deletion.
Point-in-time recovery (PITR) — Continuous backups with per-second granularity for the last 35 days. Restore to any second; the source table keeps running.
Global Tables — Multi-region active-active replication. Each region has a full copy; writes propagate asynchronously with last-writer-wins conflict resolution. Replication lag is typically < 1 second; reads are local, writes go to the local region and replicate.

Global tables are how you build a globally low-latency app on DynamoDB. The cost is real (each region's writes count separately) and the consistency model is eventual across regions — conflicts on the same key need application-level reconciliation if last-writer-wins is wrong.

Tradeoffs and When Not to Use It

No ad-hoc queries — If your team needs to answer arbitrary new questions from a table, DynamoDB will fight you. Use Athena, Spark, or export to a relational warehouse.
Cost surprises at scale — Per-request billing is convenient but can produce five-figure monthly bills from a misconfigured loop. Always set CloudWatch alarms on consumed capacity and on the ThrottledRequests metric.
Vendor lock-in — DynamoDB is AWS-only. The data model translates roughly to Cassandra, but the ecosystem (Streams, GSIs, transactions, DAX) does not.
Limited query expressiveness — No aggregations server-side. Counting items in a partition requires scanning them. Maintain counters with conditional updates instead.
The 400 KB ceiling — Anything that looks like a document store with rich nested content needs offload to S3 sooner than you'd expect.

DynamoDB vs Cassandra vs Spanner

	DynamoDB	Cassandra	Spanner
Operator	Managed (AWS only)	Self-hosted or DataStax	Managed (Google Cloud only)
Data model	Key-value + sort key	Wide-column (CQL)	Relational (SQL)
Consistency	Eventual default; strong opt-in	Tunable per query	External / serializable always
Transactions	Multi-item up to 100 / 4 MB	LWT (single partition)	Cross-partition serializable
Replication	3 AZs sync, regions async	Sync within DC, async cross-DC	Paxos sync across regions
Schema changes	None (schemaless)	Online ALTER	Online ALTER, no locks
Best for	Predictable KV at any scale	Time-series, OSS multi-cloud	Strong consistency at scale

Frequently Asked Questions

How do I do a count(*) of items?

You can't, cheaply. Scan with Select=COUNT reads every item to compute it; on a large table this is expensive and slow. The right pattern is to maintain a counter item with atomic UpdateItem ... ADD count :inc on every insert/delete, or to build the counter from DynamoDB Streams.

What's the difference between Query and Scan?

Query operates on a single partition key (and an optional sort key range). It costs RCUs proportional to the data returned and is fast — single-digit milliseconds. Scan reads every partition of the table, paginated by 1 MB chunks. It costs RCUs proportional to the entire table, even if a filter discards most rows. Use Query for everything in production. Scan is for batch jobs, exports, and tables with < 100K items.

Why am I getting throttled at 100 RCU when my table is provisioned for 10,000?

Almost certainly a hot partition. Your traffic is concentrating on a single partition key (or a small subset) and exceeding the 3000 RCU per-partition cap, even though the table-level provisioned capacity is much higher. CloudWatch's "Hot Partition" metric and the contributor insights feature surface the culprit keys. Fix: better key design or write sharding.

How is DynamoDB billed?

Three lines: storage (~$0.25/GB-month), read/write capacity (per RCU/WCU-hour in provisioned, per million requests in on-demand), and data transfer out. Add DynamoDB Streams ($0.02/100K reads), GSI capacity (separate from base), DAX (per-node-hour), backups ($0.10/GB-month), and cross-region replication for global tables.

Can I use SQL with DynamoDB?

PartiQL gives you a SQL-like syntax for SELECT/INSERT/UPDATE/DELETE, but it's a thin facade over the same API — every PartiQL query maps to a Query or Scan, with the same partition-key access requirements. There are no joins, no aggregates, no subqueries. PartiQL is convenient, not a query engine.

How do I migrate from DynamoDB to a relational database?

The right escape hatch is DynamoDB Streams + Lambda + an external sink. Live-tail every change to the new database while a one-time backfill copies the existing table (parallel scan or S3 export). Once caught up, cut writes over to the new system. Reverse migrations (RDBMS → DynamoDB) are harder because relational schemas typically support access patterns that DynamoDB models can't replicate without a redesign.

What's a sparse index and when should I use one?

A sparse GSI is an index where most table items don't have the indexed attribute, so they don't appear in the index. Useful for "find items in state X" — set a status attribute only when an item is in that state, GSI-index that attribute. The index is small and cheap to query. Pattern example: SK = "ORDER#..." and a GSI on "unshipped_marker" that's only present while the order is unshipped.