MongoDB Internals

MongoDB started as a developer-experience play — a JSON-shaped document store that didn't make you draw an ER diagram before saving your first record. Underneath that ergonomic surface is a fairly conventional B+ tree database (WiredTiger), wrapped in two distinguishing layers: a chunk-based horizontal sharding system that mongos routes queries through, and a Raft-derived replica set protocol that keeps 3-50 nodes consistent over an append-only oplog. This page dissects all three layers — storage, replication, and sharding — with real numbers from production deployments.

Architecture Overview

Key Numbers

WiredTiger Page

32 KB

Default Chunk

128 MB

Max Doc Size

16 MB

Replica Set Max

50 members

Voting Members

7 max

Election Timeout

10 s

Txn Time Limit

60 s

Why MongoDB Exists

The Gap

By 2007 web apps were storing tree-shaped objects (users with embedded address books, posts with comment threads) in relational tables that required four joins to reconstitute one logical record. ORMs like Hibernate hid the impedance mismatch but charged for it in N+1 queries and migration pain. Sharding MySQL meant writing your own router and giving up cross-shard joins anyway.

The Insight

If your aggregate root is a document — a user with their addresses, a post with its comments — then store it as one document. The unit of consistency becomes the unit of access. Add a coordinator (mongos) that routes by shard key and a Raft-style replica set for HA, and you get a horizontally scalable database whose data model matches how application objects actually look.

The Result

MongoDB became the default choice for new apps in the 2010s because the 60-second getting-started experience beat anything in the relational world. The architectural cost — you trade joins and cross-document atomicity for write scalability — only matters once your data outgrows your shard key. Until then, it just works.

WiredTiger Storage Engine

WiredTiger has been MongoDB's default engine since 3.2 (2015), replacing the old MMAPv1 mmap-based engine. It is a standalone B+ tree library that MongoDB embeds; it shares the basic shape of a Postgres B-tree but with several important differences tuned for document workloads.

Per-document concurrency. WiredTiger uses MVCC with a per-page write lock. Two writers updating different documents on different pages do not block each other. The old MMAPv1 engine held a collection-level write lock — the single biggest reason teams migrated.
Page size 32 KB. Pages split when they exceed this and are checkpointed every 60 seconds (or 2 GB of journal, whichever comes first). The page cache is sized at 50% of (RAM - 1 GB) by default.
Block compression. Each block is compressed before write. Snappy is the default (fast, ~2x ratio). Zstd (since 4.2) gives ~3-4x at higher CPU cost. Indexes use prefix compression — a sorted run of keys with shared prefixes only stores the suffix.
Two trees per collection. One B+ tree holds the documents keyed by a hidden RecordId (an 8-byte int). Each secondary index is its own B+ tree mapping {indexed_field, RecordId} → null. This is why secondary index lookups always require a final fetch back to the data tree — they are not covering by default.

// WiredTiger config (mongod.conf)
storage:
  engine: wiredTiger
  wiredTiger:
    engineConfig:
      cacheSizeGB: 16              # explicit cache size
      journalCompressor: snappy    # snappy | zstd | zlib | none
    collectionConfig:
      blockCompressor: zstd        # per-collection override
    indexConfig:
      prefixCompression: true      # default true

BSON: The Document Format

MongoDB does not store JSON. It stores BSON — a binary-encoded superset of JSON that adds typed fields (Int32, Int64, Decimal128, Date, ObjectId, Binary, UUID) and length-prefixed strings/documents so the parser can skip fields without scanning byte by byte. BSON is what travels over the wire and what WiredTiger writes to disk.

// JSON the user writes
{ "_id": "alice", "age": 30, "tags": ["admin", "ops"] }

// BSON on the wire (annotated)
\x3a\x00\x00\x00              // total document length: 58 bytes
  \x02 _id\x00 \x06\x00\x00\x00 alice\x00      // 0x02 = string, length 6, value
  \x10 age\x00 \x1e\x00\x00\x00                // 0x10 = int32, value 30
  \x04 tags\x00 \x1f\x00\x00\x00               // 0x04 = array, length 31
    \x02 0\x00 \x06\x00\x00\x00 admin\x00
    \x02 1\x00 \x04\x00\x00\x00 ops\x00
    \x00
  \x00                        // document terminator

The 16 MB document size limit is not arbitrary — it's the largest size that fits comfortably in a single network message under the 48 MB max message size, with headroom for the wrapping OP_MSG envelope. Documents that need to grow beyond that should split into a parent doc plus a "bucket" collection (the time-series collections in 5.0+ formalize this pattern).

Indexes: B-trees on Document Paths

MongoDB indexes are WiredTiger B+ trees keyed on the value at a JSON path. They support compound keys, multikey (per-array-element) entries, partial filters, TTL expiry, sparse mode, and 2dsphere/text/hashed variants. The query planner picks one index per query stage and can intersect indexes only as a last resort — practically you design the compound index to match the query.

// Compound index on (status, createdAt DESC)
db.orders.createIndex(
  { status: 1, createdAt: -1 },
  { partialFilterExpression: { archived: false } }   // partial: only non-archived
);

// Multikey index over an array path
db.posts.createIndex({ "comments.author": 1 });
// Stores one entry per array element. A post with 50 comments
// produces 50 index entries — bloat scales with array length.

// TTL index
db.sessions.createIndex(
  { expiresAt: 1 },
  { expireAfterSeconds: 0 }   // background reaper deletes when expiresAt < now()
);

// Hashed index (used for hashed sharding)
db.users.createIndex({ userId: "hashed" });

The ESR rule for compound index field order: Equality first, then Sort, then Range. A query find({status: "open"}).sort({created: -1}) against an index {status:1, created:-1} can serve both predicate and sort without an in-memory sort. Reverse the order and the planner blocks on a SORT_KEY_GENERATOR stage that materializes the full result set in memory (and fails at 100 MB with the dreaded OperationFailed: Sort exceeded memory limit).

Sharding: Chunks, Balancer, mongos

Sharding is how MongoDB scales writes horizontally. It introduces three moving pieces: a shard key (one or more fields, immutable after collection sharding), chunks (contiguous ranges of the shard key, default 128 MB), and the balancer (a background process on the config server primary that migrates chunks to even out storage).

A query enters via mongos, which holds a cached chunk map from the config replica set. If the query includes the shard key, mongos routes it to exactly the shards owning matching chunks (a "targeted" query). Without the shard key, mongos broadcasts to every shard and merges results — a "scatter-gather" that scales linearly with shard count in the wrong direction.

// Two sharding strategies
sh.shardCollection("app.orders", { customerId: 1 });          // range
sh.shardCollection("app.events", { _id: "hashed" });          // hashed

// Compound shard key — better for query targeting
sh.shardCollection("app.events", { tenantId: 1, ts: 1 });

// Inspect chunk distribution
db.orders.getShardDistribution();
// Shard shard0000 at rs0/...:  Data 412.3 MiB docs : 1 200 000 chunks : 4
// Shard shard0001 at rs1/...:  Data 419.8 MiB docs : 1 220 000 chunks : 4
// Shard shard0002 at rs2/...:  Data 408.1 MiB docs : 1 180 000 chunks : 3

Shard key choice	Write distribution	Query targeting
Monotonic (timestamp, ObjectId)	Hot shard (all writes go to chunk holding max)	Range queries excellent
Hashed (_id hashed)	Uniform	Range queries scatter-gather
Compound (tenantId, ts)	Distributed by tenant	Per-tenant queries targeted
Random UUID	Uniform	All queries scatter-gather

Replica Sets & the Oplog

Each shard is itself a replica set: 3-50 mongod processes (max 7 voting) that share a capped collection called the oplog. The primary writes operations to the oplog as part of every write; secondaries tail the oplog asynchronously and apply ops in order. The protocol is a derivative of Raft ("PV1" — protocol version 1) with leader leases.

// Sample oplog entry (local.oplog.rs)
{
  ts: Timestamp(1714857600, 1),                  // hybrid logical clock
  t: NumberLong(7),                              // election term
  h: NumberLong("-3829110238472..."),            // op hash
  v: 2,
  op: "u",                                       // u=update, i=insert, d=delete
  ns: "app.users",
  o2: { _id: ObjectId("66...") },                // query selector
  o:  { $v: 2, diff: { u: { lastSeen: ISODate("...") } } }
}

Election flow. Heartbeats every 2s. After electionTimeoutMillis (default 10 000) of missed heartbeats, a secondary increments its term and requests votes. To win, the candidate needs (a) a majority of voting members, (b) the most recent oplog entry among voters. End-to-end failover is typically 10-12 seconds. Tightening electionTimeoutMillis below 5s on cloud VMs almost always backfires — GC pauses and noisy neighbors trigger spurious elections that churn through terms.

Oplog sizing. The oplog is capped at 5% of free disk by default. Its purpose is to give a temporarily disconnected secondary enough time to catch up before falling off — once an op is gone from the oplog, the secondary requires a full resync. For high write rates, size the oplog to cover at least 24 hours of operations.

Read & Write Concerns

MongoDB exposes durability as a per-operation knob. Pick the right concern for the operation, not the deployment.

Write concern	Acks after	Loss window
`w: 1`	Primary local apply	Lost on primary crash
`w: "majority"`	Majority of voting members applied	Cannot be rolled back
`w: 1, j: true`	Primary fsync to journal	Lost if primary's disk dies
`w: "majority", j: true`	Majority + journal fsync	Survives any single failure

Read concern	Sees	Use case
`local`	Local node state (default)	Most reads
`available`	Local even on stale shard	Sharded reads tolerating staleness
`majority`	Only majority-acked data	Money, audit trails
`linearizable`	No-op write before read	Strict consistency, primary only
`snapshot`	Consistent point-in-time	Inside multi-doc transactions

Aggregation Framework

Aggregation is MongoDB's answer to "we still need analytical queries." A pipeline is an ordered list of stages, each a transformation. The planner can push $match and $sort down to use indexes, and recent versions push $group with hashed partitioning across shards ($mergeCursors).

db.orders.aggregate([
  { $match: { status: "shipped", createdAt: { $gte: ISODate("2024-01-01") } } },
  { $lookup: { from: "customers", localField: "customerId",
               foreignField: "_id", as: "customer" } },
  { $unwind: "$customer" },
  { $group: { _id: "$customer.country",
              revenue: { $sum: "$total" },
              orderCount: { $sum: 1 } } },
  { $sort: { revenue: -1 } },
  { $limit: 20 }
], { allowDiskUse: true });   // spill to disk if > 100 MB per stage

The $lookup stage is a left-outer join, but it is not distributed: it runs on the shard executing the pipeline and issues per-document queries to the foreign collection. For high cardinality joins this is a footgun. Materialized views via $merge / $out are the production answer.

Multi-Document Transactions

Available since 4.0 (replica sets) and 4.2 (sharded clusters). They use WiredTiger snapshot isolation underneath; cross-shard transactions add a two-phase commit coordinator. The hard cap is 60 seconds — long-running txns abort with TransactionExceededLifetimeLimitSeconds.

const session = client.startSession();
try {
  session.startTransaction({
    readConcern:  { level: "snapshot" },
    writeConcern: { w: "majority" }
  });

  await accounts.updateOne({ _id: "a" }, { $inc: { bal: -50 } }, { session });
  await accounts.updateOne({ _id: "b" }, { $inc: { bal:  50 } }, { session });

  await session.commitTransaction();
} catch (e) {
  await session.abortTransaction();
  throw e;
} finally {
  session.endSession();
}

The MongoDB team's own guidance is to design schemas so transactions are rarely needed. If your aggregate fits in one document, you get atomic writes for free — every single-document update is atomic without any transaction wrapping.

Schema Validation

Since 3.6, collections accept a JSON Schema validator. The validator runs on every insert/update; validationAction decides whether violations warn (logged, accepted) or error (rejected). validationLevel chooses whether existing documents (pre-validator) are checked on update.

db.createCollection("orders", {
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: ["customerId", "total", "status"],
      properties: {
        customerId: { bsonType: "objectId" },
        total:      { bsonType: "decimal", minimum: 0 },
        status:     { enum: ["pending", "shipped", "cancelled"] },
        items:      { bsonType: "array", minItems: 1 }
      }
    }
  },
  validationLevel: "strict",      // strict | moderate | off
  validationAction: "error"       // error | warn
});

Tradeoffs & When To Use

Use MongoDB when

Your aggregate root maps cleanly to a document; you want write scale-out without writing your own sharding layer; reads are mostly by primary key or shard key prefix; you can live with eventual cross-shard consistency.

Avoid MongoDB when

Your queries do many-to-many joins; you need ACID across arbitrary entities; your access patterns will change unpredictably (shard keys are immutable until 5.0 reshardCollection, and even then are expensive). PostgreSQL JSONB usually wins under 10 TB.

Operational gotchas

Bad shard key choice causes hot shards. Unbounded arrays cause document growth and index bloat. $lookup doesn't distribute. Transactions have a 60s ceiling. Replica set elections take 10-12s — design clients with retries.

FAQ

Is MongoDB schemaless?

Technically yes — collections accept BSON documents of any shape — but production deployments rarely operate that way. MongoDB 3.6+ ships JSON Schema validation that you attach to a collection (validator + validationLevel + validationAction). In practice teams commit to a schema and use $jsonSchema to enforce required fields, types, and bounds. The 'schemaless' marketing was always more of a developer ergonomics claim than an architectural one: WiredTiger still stores those documents in a B-tree keyed by _id, and secondary indexes still need consistent paths to be useful.

How does MongoDB compare to PostgreSQL JSONB for document workloads?

PostgreSQL JSONB stores binary JSON inline in heap rows and supports GIN indexes over arbitrary paths, so it's strong for hybrid relational+document queries on a single node. MongoDB wins on horizontal write scaling (sharding is a first-class operational mode, not an extension), document size limits (16 MB per document vs. 1 GB row-but-with-toast on Postgres), and aggregation pipeline ergonomics. PostgreSQL wins on ACID transactions across arbitrary keys, joins, and the entire SQL ecosystem. The honest rule: if you need >1 TB and write-heavy workloads with predictable shard keys, MongoDB. Otherwise PostgreSQL JSONB is usually the safer bet.

What is a replica set election and how long does it take?

A replica set is 3-50 mongod processes that share an oplog. Exactly one is primary, the rest secondaries. Primaries send heartbeats every 2 seconds; if a secondary misses heartbeats for electionTimeoutMillis (default 10s) it triggers an election. The election uses Raft-style voting: each member votes once per term, majority wins, and the candidate must have the most recent oplog entry among voting members. End-to-end failover typically takes 10-12 seconds (10s detection + ~1s election + ~1s client reconnect). You can lower electionTimeoutMillis but heartbeat jitter from GC pauses will start causing spurious elections.

Can MongoDB do multi-document transactions?

Yes since 4.0 (replica sets) and 4.2 (sharded clusters), but with sharp limits. Transactions hold locks on every document they touch and abort if any other write conflicts. The default transactionLifetimeLimitSeconds is 60s — exceed it and the txn aborts. Cross-shard transactions use a 2PC coordinator and have measurable latency overhead (typically 2-5x a single-shard write). MongoDB's own docs recommend using transactions sparingly: for the document model to make sense, your aggregate root should fit in one document, which removes the need for most transactions in the first place.

How does sharding actually distribute data?

You pick a shard key (one or more fields) when you shard a collection. MongoDB hashes or range-partitions the key space into chunks (default 128 MB). Each chunk lives on one shard. The mongos query router consults config servers to know which chunk lives where, and routes reads/writes accordingly. The balancer moves chunks between shards in the background to even out storage. Bad shard key selection is the #1 cause of MongoDB pain: a monotonically increasing key (like ObjectId or timestamp) creates a hot shard because all writes hit the chunk holding the max key. Hashed sharding fixes write distribution but makes range queries scatter-gather across all shards.

What's the difference between read concerns and write concerns?

Write concern (w) controls durability: w:1 acknowledges after the primary writes locally; w:majority waits for a majority of voting members to apply the oplog entry; w:'all' waits for every member. Higher w = more durable but slower. Read concern controls visibility: 'local' returns whatever the node has (may be rolled back); 'majority' returns only data acknowledged by a majority (won't be rolled back); 'linearizable' uses a no-op write to confirm the node is still primary before reading; 'snapshot' (in transactions) reads from a consistent point-in-time. Read concern majority requires majority write concern upstream to be useful.

Why does MongoDB use WiredTiger instead of a custom engine?

MongoDB's original MMAPv1 engine relied on memory-mapped files and the OS page cache, which made compaction, compression, and concurrency hard. WiredTiger (acquired in 2014, default since 3.2) gave them: per-document level locking instead of collection-level locks, snappy/zstd/zlib block compression, prefix compression on indexes, and proper checkpointing. WiredTiger uses a B+ tree per collection and per index, with 32 KB pages by default. The cache is separate from the OS page cache, sized to 50% of (RAM - 1 GB). For pure analytical workloads MongoDB's column-store engine (introduced in 7.0) layers on top of WiredTiger.