Cassandra Internals

Cassandra was the answer to a specific question: what if you took Amazon's Dynamo paper (consistent hashing, replication, eventual consistency) and Google's Bigtable paper (wide-column model, LSM-tree storage), and merged them into one system? The result is a peer-to-peer, masterless, AP-by-default database where every node is identical, scaling is linear, and the data model rewards designing your tables around your queries rather than your entities. This page dissects the moving parts that make that work: the token ring with vnodes, the write path through commit log and memtables to SSTables, compaction strategies, gossip, hinted handoff, read repair, and Merkle-tree-based anti-entropy.

Architecture Overview — The Token Ring

Key Numbers

Default vnodes (4.x)

Token Space

2^64

Memtable Flush

~1 GB

Default RF

Hint Window

3 hours

GC Grace

10 days

Bloom Filter FP

0.01

Why Cassandra Exists

The Gap

By 2008 Facebook needed inbox search at a scale where MySQL primary-replica topologies fell over: every write is global, replicas can't take writes, and a primary failure means scrambled human intervention. Operating multi-datacenter MySQL meant accepting hours of staleness or building a custom replication mesh. There was no off-the-shelf system that was masterless, multi-DC, and linearly scalable.

The Insight

If you accept eventual consistency as the default and tunable consistency per operation, you can have every node be identical (no master election), partition by consistent hash of the key (no central directory), and replicate by walking the ring (no async replication lag from a single primary). Multi-datacenter falls out for free — replicas can be chosen by datacenter using NetworkTopologyStrategy.

The Result

A database that scales linearly to thousands of nodes across continents, where adding a node is a one-command operation, and where there is genuinely no single point of failure. The cost — no joins, no real transactions, careful schema design around access patterns — is exactly what you'd expect when you trade ACID for AP at internet scale.

The Wide-Column Data Model

Cassandra's data model has three keys: a partition key decides which node owns the data; a clustering key orders rows within a partition; the rest are regular columns. CQL makes this look like SQL but the physical layout is what matters.

CREATE TABLE messages (
  user_id    UUID,
  msg_time   TIMEUUID,
  sender     TEXT,
  body       TEXT,
  attachment BLOB,
  PRIMARY KEY ((user_id), msg_time)        -- (PK), CK
) WITH CLUSTERING ORDER BY (msg_time DESC)
  AND compaction = { 'class': 'TimeWindowCompactionStrategy',
                     'compaction_window_size': 1,
                     'compaction_window_unit': 'DAYS' }
  AND default_time_to_live = 2592000;       -- 30 days

-- Logical row layout for one user's partition:
-- user_id=alice  | msg_time=t1 | sender=bob | body=...| attachment=...
--                | msg_time=t2 | sender=eve | body=...| attachment=null
--                | msg_time=t3 | ...

-- Physical layout on disk (one SSTable, partition contiguous):
-- alice : t1:sender=bob, t1:body=..., t2:sender=eve, t2:body=..., ...

Three rules of thumb that fall out of this layout:

Partition by access pattern. All rows for the same partition key live together on one node — perfect for "give me the last 50 messages for user X" — but a partition that grows unbounded (say, all events ever) becomes a hotspot and a memory bomb at compaction time. Aim for partitions under ~100 MB.
Clustering keys define on-disk sort order, so range queries within a partition are pure sequential reads.
WHERE without partition key requires ALLOW FILTERING, which scans every node. The system warns you because it almost always means the schema is wrong for the query.

The Token Ring & Vnodes

Cassandra hashes every partition key (default Murmur3Partitioner) onto a 64-bit token. The unsigned token space is conceptually a ring; each node owns one or more contiguous ranges of it. To find a key's replicas, walk the ring clockwise starting at the key's token and pick the first RF distinct nodes (constrained by datacenter and rack with NetworkTopologyStrategy).

Without vnodes, each node owns exactly one range. Bootstrapping a new node means streaming data from its two neighbors only — slow, and hot-spotted on those two source nodes. With vnodes (default 16 in 4.x, was 256 in 3.x), each physical node owns many small ranges scattered around the ring. Bootstrapping streams from many sources in parallel, and load is distributed even with heterogeneous node sizes (give a bigger node more vnodes).

# cassandra.yaml
num_tokens: 16                               # vnodes per physical node
allocate_tokens_for_local_replication_factor: 3
endpoint_snitch: GossipingPropertyFileSnitch  # rack/DC awareness
partitioner: org.apache.cassandra.dht.Murmur3Partitioner

The 256-vnode default in 3.x was a known mistake — too many vnodes meant every range query touched too many nodes, multiplying coordinator latency. 4.x dropped to 16 with smarter token allocation that keeps load even.

The Write Path

A write hits the coordinator (any node — clients pick one) which forwards it to all RF replicas. Each replica's local path is the classic LSM:

CLIENT
  └─ coordinator (any node)
       └─ partitioner.getToken(partition_key)
       └─ ring.getReplicas(token, RF)
       └─ for each replica:
            ├─ append to commit log (sequential, fsync per group)
            ├─ insert into memtable (in-memory sorted map per table)
            └─ ack to coordinator
       └─ wait for CL acks (e.g., LOCAL_QUORUM = 2 of 3)
       └─ return to client

ASYNC: when memtable hits ~1 GB or 60s timer
       ├─ flush memtable → new SSTable on disk (immutable)
       ├─ build bloom filter, partition index, summary
       └─ truncate commit log segments fully covered by SSTables

Hinted handoff. If a replica is down, the coordinator stores a hint (the write, addressed to the down replica) in a local hints file. When the replica returns within max_hint_window_in_ms (default 3 hours), the coordinator replays the hints. After 3 hours, hints are dropped — the replica catches up via read repair or anti-entropy.

SSTables & Compaction Strategies

SSTables are immutable sorted runs of (partition_key, clustering_key, column) → value entries. A single Cassandra table on a single node typically has dozens to hundreds of SSTables. Reads must consult all of them (filtered by bloom filter and min/max key range) and merge results, with later versions winning. Compaction is the background process that merges SSTables to keep this manageable.

Strategy	How	Best for	Pitfall
STCS (Size-Tiered)	Group ~4 similarly-sized SSTables, merge into one larger one	Write-heavy, append-only	Read amplification — many SSTables overlap key ranges
LCS (Leveled)	Levels L0..Ln, each ~10x the previous, non-overlapping within a level	Read-heavy, update-heavy	~10x more compaction I/O than STCS
TWCS (Time-Window)	STCS within a time window; old windows dropped wholesale on TTL	Time-series, immutable + TTL	Breaks if you UPDATE old rows — defeats window grouping
UCS (Unified, 5.0+)	Hybrid that interpolates between STCS and LCS by load	General-purpose default going forward	Newer, less battle-tested

Choosing wrong is one of Cassandra's classic foot-cannons. A messaging app that picked STCS will get progressively slower reads as old messages pile up. The same app on TWCS drops yesterday's window in O(1) once every row has expired.

The Read Path

Reads at QUORUM contact ceil(RF/2)+1 replicas and merge results. Each replica's local read is multi-stage:

READ for (partition_key, clustering_key)
  1. Check row cache (off-heap, default off — usually skipped)
  2. Check key cache: maps partition_key → SSTable offset
  3. For each candidate SSTable (filtered by min/max partition token):
       a. Check bloom filter (FP rate ~1%, kept in memory)
       b. If bloom positive, binary search partition summary
       c. Read partition index entry from disk
       d. Seek to offset, read partition data, filter by clustering_key
  4. Merge results from all SSTables + memtable, latest version wins
  5. Apply tombstones

Read repair. When the coordinator gets responses from multiple replicas at QUORUM, it compares them. If they disagree (latest timestamp wins), it pushes the merged result to the stale replicas synchronously before returning. There's also read_repair_chance for digest-only reads — the coordinator occasionally asks a replica to send a checksum instead of the value, and triggers a full read+merge if it disagrees.

Tunable Consistency: R + W > N

Every read and write picks a consistency level. The system gives you strong consistency for that operation if and only if R + W > N, where N is the replication factor.

CL	Acks needed	Latency	Failure tolerance
ANY	1 (incl. hint)	Lowest	None — hint can be dropped
ONE	1 replica	Low	Stale reads possible
QUORUM	ceil(N/2)+1	Median	Tolerates ⌊N/2⌋ failures
LOCAL_QUORUM	QUORUM in local DC	Low (no WAN)	Recommended for multi-DC
EACH_QUORUM	QUORUM in every DC	WAN-bound	Strong cross-DC, expensive
ALL	All N replicas	Highest	Fails if any replica is down

The most common production choice is RF=3, R=W=LOCAL_QUORUM: strong consistency within a datacenter, no WAN dependency for local operations, tolerates 1 node failure transparently. Cross-DC consistency is achieved by writes propagating asynchronously between DCs (one coordinator forwards to a coordinator in each DC).

Gossip & the Phi Failure Detector

Cassandra has no leader, no master, no central registry. Cluster membership is maintained by gossip: every second, each node picks 1-3 random peers and exchanges a state digest containing version-clocked tuples (heartbeat, status, schema version, load). After log(N) gossip rounds, news has reached the entire cluster.

Failure detection uses the Phi Accrual detector. Rather than a fixed timeout ("dead at 5s"), each node tracks the inter-arrival time history of gossip messages from each peer. The detector emits a continuously-valued phi (suspicion level) — when phi exceeds phi_convict_threshold (default 8), the node is marked DOWN. This adapts to network conditions: a node consistently slow is given more leeway than a node that goes silent abruptly.

Anti-Entropy Repair

Read repair and hinted handoff handle the common case, but eventually replicas can drift. Periodic repair reconciles them by computing Merkle trees per token range and exchanging hashes.

# Run on every node, every (gc_grace_seconds - safety) — e.g., weekly for default 10d grace
nodetool repair -pr keyspace1                    # primary range only

# Incremental repair (4.0+) — only un-repaired SSTables
nodetool repair -inc -pr keyspace1

# How it works internally:
# 1. Each replica builds a Merkle tree over the keyspace ranges
# 2. Replicas exchange Merkle tree hashes
# 3. For ranges where hashes differ, replicas stream the data
# 4. Newest timestamps win on merge

Without periodic repair, tombstones older than gc_grace_seconds can be purged on some replicas before they've reached others, causing "deleted" data to resurrect (the infamous "zombie data" problem). For this reason, Cassandra operations include scheduled repair as a hard requirement, not a luxury.

Tradeoffs & When To Use

Use Cassandra when

You have heavy writes, predictable access patterns by partition key, multi-datacenter requirements, and a need for linear scale to many TB. Time-series, IoT, messaging, audit logs, leaderboards, sensor data — workloads where partition-and-conquer is natural.

Avoid Cassandra when

Your queries change frequently (you can't redesign your tables every sprint). You need real ACID transactions across rows. Your access patterns are ad-hoc analytics. Your team is small — operating a Cassandra cluster well requires real expertise. PostgreSQL with read replicas often beats Cassandra under 1 TB.

Operational gotchas

Tombstones — design for them. Wide partitions (>100 MB) — split with bucketing. Wrong compaction strategy — pick by workload, not by cargo cult. Repair scheduling — non-negotiable. JVM GC pauses — hence the rise of Scylla, which uses a thread-per-core C++ engine.

Cassandra vs ScyllaDB

	Cassandra	ScyllaDB
Language / runtime	Java (JVM)	C++ (Seastar, no GC)
Threading	Thread pool, shared state	Shard-per-core, share-nothing
Tail latency	GC-pause spikes	Predictable µs-scale
Throughput per node	Baseline	3-10x typical
Wire protocol	CQL native	CQL native (drop-in)
Compaction	STCS / LCS / TWCS / UCS	STCS / LCS / TWCS / Incremental
Lightweight transactions	Paxos-based LWT	Same protocol, faster
License / model	Apache 2.0, community	Open + Enterprise paywall

FAQ

Is Cassandra a key-value store or a column store?

Both, depending on the lens. Logically Cassandra is a wide-column store: rows are identified by a partition key plus an optional clustering key, and within a partition you can have many columns (think wide rows like a sparse spreadsheet). Physically the storage engine is more like a key-value LSM-tree where the key is (partition_key, clustering_key, column_name). Modern CQL hides most of this — you write tables and queries that look broadly relational. The wide-column heritage shows up in things like 'don't use ALLOW FILTERING' and 'put your access pattern in the partition key, not in WHERE clauses.'

How does tunable consistency actually work?

Every read and write specifies a consistency level: ONE, LOCAL_ONE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, ALL, ANY. Writes wait for that many replicas to acknowledge before returning success; reads wait for that many to respond before returning the data. The math: if you set R + W > N (where N is the replication factor), you get strong consistency for that operation. So with N=3, R=W=QUORUM=2 satisfies R+W=4>3. R=ONE+W=ONE only satisfies R+W=2<3, so you can read a stale value. The 'tunable' part is that you make this choice per-statement — heavy writes can use QUORUM while less critical reads use ONE.

What is a vnode and why do they exist?

Cassandra hashes every partition key onto a 64-bit token ring. Originally each node owned exactly one contiguous range of the ring; this made bootstrap and decommission slow because all data movement was between two nodes. Vnodes (since 1.2) split each node's ownership into many small ranges (default 16 in 4.0+, was 256 in 3.x). When you add a node, it claims tokens from many existing nodes simultaneously, distributing the streaming load. A 16-node cluster with 16 vnodes each looks like 256 logical owners on the ring. The downside: each query that doesn't specify the partition key may need to talk to many vnodes, multiplying coordinator overhead.

Why doesn't Cassandra do joins?

Cassandra optimizes for predictable horizontal scale and low-latency reads. A join would mean a coordinator scattering reads across (potentially) every node in the cluster, joining in memory, and returning. That's a query model that breaks the latency contract. The Cassandra answer is: denormalize. Materialize each query's result into its own table at write time. Modern CQL supports materialized views and SAI indexes that automate some of this, but the rule of thumb stays: model queries first, schema second.

How does compaction work and why does it matter?

Writes append to the commit log + memtable, then the memtable flushes to an immutable SSTable. Over time you accumulate dozens of SSTables containing different versions of the same row. Compaction merges them, dropping tombstoned and overwritten data. There are three strategies: Size-Tiered (STCS) groups similarly-sized SSTables and merges them — best for write-heavy workloads but causes read amplification. Leveled (LCS) maintains levels where each SSTable in level L+1 is 10x larger than level L, with non-overlapping keys per level — best for read-heavy and update-heavy workloads but uses more I/O for compaction. Time-Window (TWCS) groups SSTables by write time — used for time-series with TTL, where old SSTables can drop wholesale.

When do tombstones become a problem?

Every delete or expiring TTL writes a tombstone — a marker saying 'this column is deleted.' Tombstones must outlive gc_grace_seconds (default 10 days, equal to the hinted handoff window) so they're not 'forgotten' before all replicas process them. If you read a partition with millions of tombstones, the read path has to scan all of them, sort with the live data, and apply deletions. This caused famous outages where a queue-style workload (insert + delete + read range) ground to a halt. The mitigation: avoid using Cassandra as a queue, use TWCS for TTL data, and watch the tombstone_warn_threshold + tombstone_failure_threshold settings.

How does Cassandra compare to ScyllaDB?

Scylla is a from-scratch C++ rewrite of Cassandra that keeps the protocol and data model but throws out the JVM and the threading model. Where Cassandra runs a thread pool with shared state and locks, Scylla runs one thread pinned per CPU core, with a shared-nothing architecture (Seastar framework) and asynchronous I/O. The result is 3-10x throughput per node and predictable tail latency without GC pauses. Operationally Scylla is a drop-in replacement for most Cassandra workloads. The cost: smaller community, smaller ecosystem of tools, and Scylla's enterprise features sit behind a paywall.