Cassandra Internals
Cassandra was the answer to a specific question: what if you took Amazon's Dynamo paper (consistent hashing, replication, eventual consistency) and Google's Bigtable paper (wide-column model, LSM-tree storage), and merged them into one system? The result is a peer-to-peer, masterless, AP-by-default database where every node is identical, scaling is linear, and the data model rewards designing your tables around your queries rather than your entities. This page dissects the moving parts that make that work: the token ring with vnodes, the write path through commit log and memtables to SSTables, compaction strategies, gossip, hinted handoff, read repair, and Merkle-tree-based anti-entropy.
Architecture Overview β The Token Ring
Key Numbers
Why Cassandra Exists
The Wide-Column Data Model
Cassandra's data model has three keys: a partition key decides which node owns the data; a clustering key orders rows within a partition; the rest are regular columns. CQL makes this look like SQL but the physical layout is what matters.
CREATE TABLE messages (
user_id UUID,
msg_time TIMEUUID,
sender TEXT,
body TEXT,
attachment BLOB,
PRIMARY KEY ((user_id), msg_time) -- (PK), CK
) WITH CLUSTERING ORDER BY (msg_time DESC)
AND compaction = { 'class': 'TimeWindowCompactionStrategy',
'compaction_window_size': 1,
'compaction_window_unit': 'DAYS' }
AND default_time_to_live = 2592000; -- 30 days
-- Logical row layout for one user's partition:
-- user_id=alice | msg_time=t1 | sender=bob | body=...| attachment=...
-- | msg_time=t2 | sender=eve | body=...| attachment=null
-- | msg_time=t3 | ...
-- Physical layout on disk (one SSTable, partition contiguous):
-- alice : t1:sender=bob, t1:body=..., t2:sender=eve, t2:body=..., ... Three rules of thumb that fall out of this layout:
- Partition by access pattern. All rows for the same partition key live together on one node β perfect for "give me the last 50 messages for user X" β but a partition that grows unbounded (say, all events ever) becomes a hotspot and a memory bomb at compaction time. Aim for partitions under ~100 MB.
- Clustering keys define on-disk sort order, so range queries within a partition are pure sequential reads.
- WHERE without partition key requires
ALLOW FILTERING, which scans every node. The system warns you because it almost always means the schema is wrong for the query.
The Token Ring & Vnodes
Cassandra hashes every partition key (default Murmur3Partitioner)
onto a 64-bit token. The unsigned token space is conceptually a ring;
each node owns one or more contiguous ranges of it. To find a key's
replicas, walk the ring clockwise starting at the key's token and pick
the first RF distinct nodes (constrained by datacenter and
rack with NetworkTopologyStrategy).
Without vnodes, each node owns exactly one range. Bootstrapping a new node means streaming data from its two neighbors only β slow, and hot-spotted on those two source nodes. With vnodes (default 16 in 4.x, was 256 in 3.x), each physical node owns many small ranges scattered around the ring. Bootstrapping streams from many sources in parallel, and load is distributed even with heterogeneous node sizes (give a bigger node more vnodes).
# cassandra.yaml
num_tokens: 16 # vnodes per physical node
allocate_tokens_for_local_replication_factor: 3
endpoint_snitch: GossipingPropertyFileSnitch # rack/DC awareness
partitioner: org.apache.cassandra.dht.Murmur3Partitioner The 256-vnode default in 3.x was a known mistake β too many vnodes meant every range query touched too many nodes, multiplying coordinator latency. 4.x dropped to 16 with smarter token allocation that keeps load even.
The Write Path
A write hits the coordinator (any node β clients pick one) which forwards
it to all RF replicas. Each replica's local path is the
classic LSM:
CLIENT
ββ coordinator (any node)
ββ partitioner.getToken(partition_key)
ββ ring.getReplicas(token, RF)
ββ for each replica:
ββ append to commit log (sequential, fsync per group)
ββ insert into memtable (in-memory sorted map per table)
ββ ack to coordinator
ββ wait for CL acks (e.g., LOCAL_QUORUM = 2 of 3)
ββ return to client
ASYNC: when memtable hits ~1 GB or 60s timer
ββ flush memtable β new SSTable on disk (immutable)
ββ build bloom filter, partition index, summary
ββ truncate commit log segments fully covered by SSTables Hinted handoff. If a replica is down, the coordinator
stores a hint (the write, addressed to the down replica) in a
local hints file. When the replica returns within
max_hint_window_in_ms (default 3 hours), the coordinator
replays the hints. After 3 hours, hints are dropped β the replica
catches up via read repair or anti-entropy.
SSTables & Compaction Strategies
SSTables are immutable sorted runs of (partition_key, clustering_key, column) β value entries. A single Cassandra table on a single node typically has dozens to hundreds of SSTables. Reads must consult all of them (filtered by bloom filter and min/max key range) and merge results, with later versions winning. Compaction is the background process that merges SSTables to keep this manageable.
| Strategy | How | Best for | Pitfall |
|---|---|---|---|
| STCS (Size-Tiered) | Group ~4 similarly-sized SSTables, merge into one larger one | Write-heavy, append-only | Read amplification β many SSTables overlap key ranges |
| LCS (Leveled) | Levels L0..Ln, each ~10x the previous, non-overlapping within a level | Read-heavy, update-heavy | ~10x more compaction I/O than STCS |
| TWCS (Time-Window) | STCS within a time window; old windows dropped wholesale on TTL | Time-series, immutable + TTL | Breaks if you UPDATE old rows β defeats window grouping |
| UCS (Unified, 5.0+) | Hybrid that interpolates between STCS and LCS by load | General-purpose default going forward | Newer, less battle-tested |
Choosing wrong is one of Cassandra's classic foot-cannons. A messaging app that picked STCS will get progressively slower reads as old messages pile up. The same app on TWCS drops yesterday's window in O(1) once every row has expired.
The Read Path
Reads at QUORUM contact ceil(RF/2)+1 replicas
and merge results. Each replica's local read is multi-stage:
READ for (partition_key, clustering_key)
1. Check row cache (off-heap, default off β usually skipped)
2. Check key cache: maps partition_key β SSTable offset
3. For each candidate SSTable (filtered by min/max partition token):
a. Check bloom filter (FP rate ~1%, kept in memory)
b. If bloom positive, binary search partition summary
c. Read partition index entry from disk
d. Seek to offset, read partition data, filter by clustering_key
4. Merge results from all SSTables + memtable, latest version wins
5. Apply tombstones Read repair. When the coordinator gets responses from
multiple replicas at QUORUM, it compares them. If they
disagree (latest timestamp wins), it pushes the merged result to the
stale replicas synchronously before returning. There's also
read_repair_chance for digest-only reads β the coordinator
occasionally asks a replica to send a checksum instead of the value, and
triggers a full read+merge if it disagrees.
Tunable Consistency: R + W > N
Every read and write picks a consistency level. The system gives you
strong consistency for that operation if and only if
R + W > N, where N is the replication factor.
| CL | Acks needed | Latency | Failure tolerance |
|---|---|---|---|
| ANY | 1 (incl. hint) | Lowest | None β hint can be dropped |
| ONE | 1 replica | Low | Stale reads possible |
| QUORUM | ceil(N/2)+1 | Median | Tolerates βN/2β failures |
| LOCAL_QUORUM | QUORUM in local DC | Low (no WAN) | Recommended for multi-DC |
| EACH_QUORUM | QUORUM in every DC | WAN-bound | Strong cross-DC, expensive |
| ALL | All N replicas | Highest | Fails if any replica is down |
The most common production choice is RF=3, R=W=LOCAL_QUORUM:
strong consistency within a datacenter, no WAN dependency for local
operations, tolerates 1 node failure transparently. Cross-DC consistency
is achieved by writes propagating asynchronously between DCs (one
coordinator forwards to a coordinator in each DC).
Gossip & the Phi Failure Detector
Cassandra has no leader, no master, no central registry. Cluster membership is maintained by gossip: every second, each node picks 1-3 random peers and exchanges a state digest containing version-clocked tuples (heartbeat, status, schema version, load). After log(N) gossip rounds, news has reached the entire cluster.
Failure detection uses the Phi Accrual detector. Rather
than a fixed timeout ("dead at 5s"), each node tracks the inter-arrival
time history of gossip messages from each peer. The detector emits a
continuously-valued phi (suspicion level) β when phi exceeds
phi_convict_threshold (default 8), the node is marked DOWN.
This adapts to network conditions: a node consistently slow is given
more leeway than a node that goes silent abruptly.
Anti-Entropy Repair
Read repair and hinted handoff handle the common case, but eventually replicas can drift. Periodic repair reconciles them by computing Merkle trees per token range and exchanging hashes.
# Run on every node, every (gc_grace_seconds - safety) β e.g., weekly for default 10d grace
nodetool repair -pr keyspace1 # primary range only
# Incremental repair (4.0+) β only un-repaired SSTables
nodetool repair -inc -pr keyspace1
# How it works internally:
# 1. Each replica builds a Merkle tree over the keyspace ranges
# 2. Replicas exchange Merkle tree hashes
# 3. For ranges where hashes differ, replicas stream the data
# 4. Newest timestamps win on merge
Without periodic repair, tombstones older than gc_grace_seconds
can be purged on some replicas before they've reached others, causing
"deleted" data to resurrect (the infamous "zombie data" problem). For
this reason, Cassandra operations include scheduled repair as a hard
requirement, not a luxury.
Tradeoffs & When To Use
Cassandra vs ScyllaDB
| Cassandra | ScyllaDB | |
|---|---|---|
| Language / runtime | Java (JVM) | C++ (Seastar, no GC) |
| Threading | Thread pool, shared state | Shard-per-core, share-nothing |
| Tail latency | GC-pause spikes | Predictable Β΅s-scale |
| Throughput per node | Baseline | 3-10x typical |
| Wire protocol | CQL native | CQL native (drop-in) |
| Compaction | STCS / LCS / TWCS / UCS | STCS / LCS / TWCS / Incremental |
| Lightweight transactions | Paxos-based LWT | Same protocol, faster |
| License / model | Apache 2.0, community | Open + Enterprise paywall |
FAQ
Is Cassandra a key-value store or a column store?
Both, depending on the lens. Logically Cassandra is a wide-column store: rows are identified by a partition key plus an optional clustering key, and within a partition you can have many columns (think wide rows like a sparse spreadsheet). Physically the storage engine is more like a key-value LSM-tree where the key is (partition_key, clustering_key, column_name). Modern CQL hides most of this β you write tables and queries that look broadly relational. The wide-column heritage shows up in things like 'don't use ALLOW FILTERING' and 'put your access pattern in the partition key, not in WHERE clauses.'
How does tunable consistency actually work?
Every read and write specifies a consistency level: ONE, LOCAL_ONE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, ALL, ANY. Writes wait for that many replicas to acknowledge before returning success; reads wait for that many to respond before returning the data. The math: if you set R + W > N (where N is the replication factor), you get strong consistency for that operation. So with N=3, R=W=QUORUM=2 satisfies R+W=4>3. R=ONE+W=ONE only satisfies R+W=2<3, so you can read a stale value. The 'tunable' part is that you make this choice per-statement β heavy writes can use QUORUM while less critical reads use ONE.
What is a vnode and why do they exist?
Cassandra hashes every partition key onto a 64-bit token ring. Originally each node owned exactly one contiguous range of the ring; this made bootstrap and decommission slow because all data movement was between two nodes. Vnodes (since 1.2) split each node's ownership into many small ranges (default 16 in 4.0+, was 256 in 3.x). When you add a node, it claims tokens from many existing nodes simultaneously, distributing the streaming load. A 16-node cluster with 16 vnodes each looks like 256 logical owners on the ring. The downside: each query that doesn't specify the partition key may need to talk to many vnodes, multiplying coordinator overhead.
Why doesn't Cassandra do joins?
Cassandra optimizes for predictable horizontal scale and low-latency reads. A join would mean a coordinator scattering reads across (potentially) every node in the cluster, joining in memory, and returning. That's a query model that breaks the latency contract. The Cassandra answer is: denormalize. Materialize each query's result into its own table at write time. Modern CQL supports materialized views and SAI indexes that automate some of this, but the rule of thumb stays: model queries first, schema second.
How does compaction work and why does it matter?
Writes append to the commit log + memtable, then the memtable flushes to an immutable SSTable. Over time you accumulate dozens of SSTables containing different versions of the same row. Compaction merges them, dropping tombstoned and overwritten data. There are three strategies: Size-Tiered (STCS) groups similarly-sized SSTables and merges them β best for write-heavy workloads but causes read amplification. Leveled (LCS) maintains levels where each SSTable in level L+1 is 10x larger than level L, with non-overlapping keys per level β best for read-heavy and update-heavy workloads but uses more I/O for compaction. Time-Window (TWCS) groups SSTables by write time β used for time-series with TTL, where old SSTables can drop wholesale.
When do tombstones become a problem?
Every delete or expiring TTL writes a tombstone β a marker saying 'this column is deleted.' Tombstones must outlive gc_grace_seconds (default 10 days, equal to the hinted handoff window) so they're not 'forgotten' before all replicas process them. If you read a partition with millions of tombstones, the read path has to scan all of them, sort with the live data, and apply deletions. This caused famous outages where a queue-style workload (insert + delete + read range) ground to a halt. The mitigation: avoid using Cassandra as a queue, use TWCS for TTL data, and watch the tombstone_warn_threshold + tombstone_failure_threshold settings.
How does Cassandra compare to ScyllaDB?
Scylla is a from-scratch C++ rewrite of Cassandra that keeps the protocol and data model but throws out the JVM and the threading model. Where Cassandra runs a thread pool with shared state and locks, Scylla runs one thread pinned per CPU core, with a shared-nothing architecture (Seastar framework) and asynchronous I/O. The result is 3-10x throughput per node and predictable tail latency without GC pauses. Operationally Scylla is a drop-in replacement for most Cassandra workloads. The cost: smaller community, smaller ecosystem of tools, and Scylla's enterprise features sit behind a paywall.