MongoDB Internals
MongoDB started as a developer-experience play β a JSON-shaped document store that didn't make you draw an ER diagram before saving your first record. Underneath that ergonomic surface is a fairly conventional B+ tree database (WiredTiger), wrapped in two distinguishing layers: a chunk-based horizontal sharding system that mongos routes queries through, and a Raft-derived replica set protocol that keeps 3-50 nodes consistent over an append-only oplog. This page dissects all three layers β storage, replication, and sharding β with real numbers from production deployments.
Architecture Overview
Key Numbers
Why MongoDB Exists
WiredTiger Storage Engine
WiredTiger has been MongoDB's default engine since 3.2 (2015), replacing the old MMAPv1 mmap-based engine. It is a standalone B+ tree library that MongoDB embeds; it shares the basic shape of a Postgres B-tree but with several important differences tuned for document workloads.
- Per-document concurrency. WiredTiger uses MVCC with a per-page write lock. Two writers updating different documents on different pages do not block each other. The old MMAPv1 engine held a collection-level write lock β the single biggest reason teams migrated.
- Page size 32 KB. Pages split when they exceed this and are checkpointed every 60 seconds (or 2 GB of journal, whichever comes first). The page cache is sized at
50% of (RAM - 1 GB)by default. - Block compression. Each block is compressed before write. Snappy is the default (fast, ~2x ratio). Zstd (since 4.2) gives ~3-4x at higher CPU cost. Indexes use prefix compression β a sorted run of keys with shared prefixes only stores the suffix.
- Two trees per collection. One B+ tree holds the documents keyed by a hidden
RecordId(an 8-byte int). Each secondary index is its own B+ tree mapping {indexed_field, RecordId} β null. This is why secondary index lookups always require a final fetch back to the data tree β they are not covering by default.
// WiredTiger config (mongod.conf)
storage:
engine: wiredTiger
wiredTiger:
engineConfig:
cacheSizeGB: 16 # explicit cache size
journalCompressor: snappy # snappy | zstd | zlib | none
collectionConfig:
blockCompressor: zstd # per-collection override
indexConfig:
prefixCompression: true # default true BSON: The Document Format
MongoDB does not store JSON. It stores BSON β a binary-encoded superset of JSON that adds typed fields (Int32, Int64, Decimal128, Date, ObjectId, Binary, UUID) and length-prefixed strings/documents so the parser can skip fields without scanning byte by byte. BSON is what travels over the wire and what WiredTiger writes to disk.
// JSON the user writes
{ "_id": "alice", "age": 30, "tags": ["admin", "ops"] }
// BSON on the wire (annotated)
\x3a\x00\x00\x00 // total document length: 58 bytes
\x02 _id\x00 \x06\x00\x00\x00 alice\x00 // 0x02 = string, length 6, value
\x10 age\x00 \x1e\x00\x00\x00 // 0x10 = int32, value 30
\x04 tags\x00 \x1f\x00\x00\x00 // 0x04 = array, length 31
\x02 0\x00 \x06\x00\x00\x00 admin\x00
\x02 1\x00 \x04\x00\x00\x00 ops\x00
\x00
\x00 // document terminator
The 16 MB document size limit is not arbitrary β it's the largest size
that fits comfortably in a single network message under the 48 MB max
message size, with headroom for the wrapping OP_MSG envelope.
Documents that need to grow beyond that should split into a parent doc
plus a "bucket" collection (the time-series collections in 5.0+ formalize
this pattern).
Indexes: B-trees on Document Paths
MongoDB indexes are WiredTiger B+ trees keyed on the value at a JSON path. They support compound keys, multikey (per-array-element) entries, partial filters, TTL expiry, sparse mode, and 2dsphere/text/hashed variants. The query planner picks one index per query stage and can intersect indexes only as a last resort β practically you design the compound index to match the query.
// Compound index on (status, createdAt DESC)
db.orders.createIndex(
{ status: 1, createdAt: -1 },
{ partialFilterExpression: { archived: false } } // partial: only non-archived
);
// Multikey index over an array path
db.posts.createIndex({ "comments.author": 1 });
// Stores one entry per array element. A post with 50 comments
// produces 50 index entries β bloat scales with array length.
// TTL index
db.sessions.createIndex(
{ expiresAt: 1 },
{ expireAfterSeconds: 0 } // background reaper deletes when expiresAt < now()
);
// Hashed index (used for hashed sharding)
db.users.createIndex({ userId: "hashed" });
The ESR rule for compound index field order: Equality
first, then Sort, then Range. A query
find({status: "open"}).sort({created: -1}) against an index
{status:1, created:-1} can serve both predicate and sort
without an in-memory sort. Reverse the order and the planner blocks on a
SORT_KEY_GENERATOR stage that materializes the full result set in memory
(and fails at 100 MB with the dreaded OperationFailed: Sort exceeded memory limit).
Sharding: Chunks, Balancer, mongos
Sharding is how MongoDB scales writes horizontally. It introduces three moving pieces: a shard key (one or more fields, immutable after collection sharding), chunks (contiguous ranges of the shard key, default 128 MB), and the balancer (a background process on the config server primary that migrates chunks to even out storage).
A query enters via mongos, which holds a cached chunk map
from the config replica set. If the query includes the shard key, mongos
routes it to exactly the shards owning matching chunks (a "targeted"
query). Without the shard key, mongos broadcasts to every shard and
merges results β a "scatter-gather" that scales linearly with shard count
in the wrong direction.
// Two sharding strategies
sh.shardCollection("app.orders", { customerId: 1 }); // range
sh.shardCollection("app.events", { _id: "hashed" }); // hashed
// Compound shard key β better for query targeting
sh.shardCollection("app.events", { tenantId: 1, ts: 1 });
// Inspect chunk distribution
db.orders.getShardDistribution();
// Shard shard0000 at rs0/...: Data 412.3 MiB docs : 1 200 000 chunks : 4
// Shard shard0001 at rs1/...: Data 419.8 MiB docs : 1 220 000 chunks : 4
// Shard shard0002 at rs2/...: Data 408.1 MiB docs : 1 180 000 chunks : 3 | Shard key choice | Write distribution | Query targeting |
|---|---|---|
| Monotonic (timestamp, ObjectId) | Hot shard (all writes go to chunk holding max) | Range queries excellent |
| Hashed (_id hashed) | Uniform | Range queries scatter-gather |
| Compound (tenantId, ts) | Distributed by tenant | Per-tenant queries targeted |
| Random UUID | Uniform | All queries scatter-gather |
Replica Sets & the Oplog
Each shard is itself a replica set: 3-50 mongod processes (max 7 voting) that share a capped collection called the oplog. The primary writes operations to the oplog as part of every write; secondaries tail the oplog asynchronously and apply ops in order. The protocol is a derivative of Raft ("PV1" β protocol version 1) with leader leases.
// Sample oplog entry (local.oplog.rs)
{
ts: Timestamp(1714857600, 1), // hybrid logical clock
t: NumberLong(7), // election term
h: NumberLong("-3829110238472..."), // op hash
v: 2,
op: "u", // u=update, i=insert, d=delete
ns: "app.users",
o2: { _id: ObjectId("66...") }, // query selector
o: { $v: 2, diff: { u: { lastSeen: ISODate("...") } } }
} Election flow. Heartbeats every 2s. After
electionTimeoutMillis (default 10 000) of missed heartbeats,
a secondary increments its term and requests votes. To win, the candidate
needs (a) a majority of voting members, (b) the most recent oplog entry
among voters. End-to-end failover is typically 10-12 seconds. Tightening
electionTimeoutMillis below 5s on cloud VMs almost always
backfires β GC pauses and noisy neighbors trigger spurious elections that
churn through terms.
Oplog sizing. The oplog is capped at 5% of free disk by default. Its purpose is to give a temporarily disconnected secondary enough time to catch up before falling off β once an op is gone from the oplog, the secondary requires a full resync. For high write rates, size the oplog to cover at least 24 hours of operations.
Read & Write Concerns
MongoDB exposes durability as a per-operation knob. Pick the right concern for the operation, not the deployment.
| Write concern | Acks after | Loss window |
|---|---|---|
w: 1 | Primary local apply | Lost on primary crash |
w: "majority" | Majority of voting members applied | Cannot be rolled back |
w: 1, j: true | Primary fsync to journal | Lost if primary's disk dies |
w: "majority", j: true | Majority + journal fsync | Survives any single failure |
| Read concern | Sees | Use case |
|---|---|---|
local | Local node state (default) | Most reads |
available | Local even on stale shard | Sharded reads tolerating staleness |
majority | Only majority-acked data | Money, audit trails |
linearizable | No-op write before read | Strict consistency, primary only |
snapshot | Consistent point-in-time | Inside multi-doc transactions |
Aggregation Framework
Aggregation is MongoDB's answer to "we still need analytical queries."
A pipeline is an ordered list of stages, each a transformation. The
planner can push $match and $sort down to use
indexes, and recent versions push $group with hashed
partitioning across shards ($mergeCursors).
db.orders.aggregate([
{ $match: { status: "shipped", createdAt: { $gte: ISODate("2024-01-01") } } },
{ $lookup: { from: "customers", localField: "customerId",
foreignField: "_id", as: "customer" } },
{ $unwind: "$customer" },
{ $group: { _id: "$customer.country",
revenue: { $sum: "$total" },
orderCount: { $sum: 1 } } },
{ $sort: { revenue: -1 } },
{ $limit: 20 }
], { allowDiskUse: true }); // spill to disk if > 100 MB per stage
The $lookup stage is a left-outer join, but it is
not distributed: it runs on the shard executing the pipeline and
issues per-document queries to the foreign collection. For high cardinality
joins this is a footgun. Materialized views via $merge /
$out are the production answer.
Multi-Document Transactions
Available since 4.0 (replica sets) and 4.2 (sharded clusters). They use
WiredTiger snapshot isolation underneath; cross-shard transactions add a
two-phase commit coordinator. The hard cap is 60 seconds β long-running
txns abort with TransactionExceededLifetimeLimitSeconds.
const session = client.startSession();
try {
session.startTransaction({
readConcern: { level: "snapshot" },
writeConcern: { w: "majority" }
});
await accounts.updateOne({ _id: "a" }, { $inc: { bal: -50 } }, { session });
await accounts.updateOne({ _id: "b" }, { $inc: { bal: 50 } }, { session });
await session.commitTransaction();
} catch (e) {
await session.abortTransaction();
throw e;
} finally {
session.endSession();
} The MongoDB team's own guidance is to design schemas so transactions are rarely needed. If your aggregate fits in one document, you get atomic writes for free β every single-document update is atomic without any transaction wrapping.
Schema Validation
Since 3.6, collections accept a JSON Schema validator. The validator runs
on every insert/update; validationAction decides whether
violations warn (logged, accepted) or error (rejected).
validationLevel chooses whether existing documents (pre-validator)
are checked on update.
db.createCollection("orders", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["customerId", "total", "status"],
properties: {
customerId: { bsonType: "objectId" },
total: { bsonType: "decimal", minimum: 0 },
status: { enum: ["pending", "shipped", "cancelled"] },
items: { bsonType: "array", minItems: 1 }
}
}
},
validationLevel: "strict", // strict | moderate | off
validationAction: "error" // error | warn
}); Tradeoffs & When To Use
$lookup doesn't distribute. Transactions have a 60s ceiling. Replica set elections take 10-12s β design clients with retries.FAQ
Is MongoDB schemaless?
Technically yes β collections accept BSON documents of any shape β but production deployments rarely operate that way. MongoDB 3.6+ ships JSON Schema validation that you attach to a collection (validator + validationLevel + validationAction). In practice teams commit to a schema and use $jsonSchema to enforce required fields, types, and bounds. The 'schemaless' marketing was always more of a developer ergonomics claim than an architectural one: WiredTiger still stores those documents in a B-tree keyed by _id, and secondary indexes still need consistent paths to be useful.
How does MongoDB compare to PostgreSQL JSONB for document workloads?
PostgreSQL JSONB stores binary JSON inline in heap rows and supports GIN indexes over arbitrary paths, so it's strong for hybrid relational+document queries on a single node. MongoDB wins on horizontal write scaling (sharding is a first-class operational mode, not an extension), document size limits (16 MB per document vs. 1 GB row-but-with-toast on Postgres), and aggregation pipeline ergonomics. PostgreSQL wins on ACID transactions across arbitrary keys, joins, and the entire SQL ecosystem. The honest rule: if you need >1 TB and write-heavy workloads with predictable shard keys, MongoDB. Otherwise PostgreSQL JSONB is usually the safer bet.
What is a replica set election and how long does it take?
A replica set is 3-50 mongod processes that share an oplog. Exactly one is primary, the rest secondaries. Primaries send heartbeats every 2 seconds; if a secondary misses heartbeats for electionTimeoutMillis (default 10s) it triggers an election. The election uses Raft-style voting: each member votes once per term, majority wins, and the candidate must have the most recent oplog entry among voting members. End-to-end failover typically takes 10-12 seconds (10s detection + ~1s election + ~1s client reconnect). You can lower electionTimeoutMillis but heartbeat jitter from GC pauses will start causing spurious elections.
Can MongoDB do multi-document transactions?
Yes since 4.0 (replica sets) and 4.2 (sharded clusters), but with sharp limits. Transactions hold locks on every document they touch and abort if any other write conflicts. The default transactionLifetimeLimitSeconds is 60s β exceed it and the txn aborts. Cross-shard transactions use a 2PC coordinator and have measurable latency overhead (typically 2-5x a single-shard write). MongoDB's own docs recommend using transactions sparingly: for the document model to make sense, your aggregate root should fit in one document, which removes the need for most transactions in the first place.
How does sharding actually distribute data?
You pick a shard key (one or more fields) when you shard a collection. MongoDB hashes or range-partitions the key space into chunks (default 128 MB). Each chunk lives on one shard. The mongos query router consults config servers to know which chunk lives where, and routes reads/writes accordingly. The balancer moves chunks between shards in the background to even out storage. Bad shard key selection is the #1 cause of MongoDB pain: a monotonically increasing key (like ObjectId or timestamp) creates a hot shard because all writes hit the chunk holding the max key. Hashed sharding fixes write distribution but makes range queries scatter-gather across all shards.
What's the difference between read concerns and write concerns?
Write concern (w) controls durability: w:1 acknowledges after the primary writes locally; w:majority waits for a majority of voting members to apply the oplog entry; w:'all' waits for every member. Higher w = more durable but slower. Read concern controls visibility: 'local' returns whatever the node has (may be rolled back); 'majority' returns only data acknowledged by a majority (won't be rolled back); 'linearizable' uses a no-op write to confirm the node is still primary before reading; 'snapshot' (in transactions) reads from a consistent point-in-time. Read concern majority requires majority write concern upstream to be useful.
Why does MongoDB use WiredTiger instead of a custom engine?
MongoDB's original MMAPv1 engine relied on memory-mapped files and the OS page cache, which made compaction, compression, and concurrency hard. WiredTiger (acquired in 2014, default since 3.2) gave them: per-document level locking instead of collection-level locks, snappy/zstd/zlib block compression, prefix compression on indexes, and proper checkpointing. WiredTiger uses a B+ tree per collection and per index, with 32 KB pages by default. The cache is separate from the OS page cache, sized to 50% of (RAM - 1 GB). For pure analytical workloads MongoDB's column-store engine (introduced in 7.0) layers on top of WiredTiger.