Kafka Internals
Kafka is, at its core, a distributed append-only log with a clever partitioning model and a battle-hardened replication protocol. Everything else — Streams, Connect, Schema Registry, KRaft — is built on that primitive. Understanding the partition (the unit of ordering, parallelism, and replication), the ISR (the durability contract), and the consumer group rebalance (the coordination algorithm) is enough to predict almost any Kafka behaviour in production. This hub assembles those primitives into a clear mental model.
Grounded in Apache Kafka 3.x, the KIP archive, and the Confluent reference architecture.
Topic, Partition, Segment, Replica
Key Numbers
Why Kafka Exists
Consumer Groups
Partition assignment, rebalances, offset commits, and how Kafka achieves at-least-once delivery across N parallel consumers without coordinating per-message
Partitions & Segments
How a topic becomes ordered logs on disk, segment rotation, the index files, and why the partition is the unit of parallelism, ordering, and replication
Producer Internals
Batching, linger.ms, compression, idempotency, and the accumulator — what happens between send() and the broker ACK
Replication & ISR
Leader/follower fetch, the in-sync replica set, min.insync.replicas, and how acks=all gives you durability without paying for synchronous quorum on every write
Exactly-Once Semantics
Idempotent producer, transactional producer, the transaction coordinator, and why EOS is end-to-end-only with read_committed consumers
KRaft (Kafka Raft)
The metadata quorum that replaced ZooKeeper — how controller election, metadata replication, and log-based config storage finally collapsed into one system
Tiered Storage
Offloading old segments to S3/GCS while keeping the hot tail on local disk — KIP-405's economics for terabyte-retention topics
Kafka Streams vs Flink
Library vs framework, processor topology vs DataStream, RocksDB state stores vs managed state — when each shape wins
Topic, Partition, Segment
A topic is a logical name. It is split into a fixed number of partitions at creation time. Each partition is an ordered, immutable, append-only sequence of records identified by a monotonic 64-bit offset starting at zero. Ordering is guaranteed within a partition, never across them. The partition count is the unit of parallelism: a topic with 12 partitions can be consumed by up to 12 consumers in a single group simultaneously.
Each partition lives on a single broker as a directory containing a sequence of segments:
a .log file plus .index and .timeindex files, all named by the
base offset of the segment.
The active segment is the one currently being appended to; it rolls when it exceeds segment.bytes
(default 1 GiB) or segment.ms (default 7 days). Closed segments are immutable and become eligible
for retention deletion or tiered offload.
The .index file is a sparse mapping from offset to byte position in the log file (one entry per
index.interval.bytes, default 4 KiB). Reads binary-search the index, then linearly scan a tiny
window of the log. The .timeindex mirrors this for timestamp-based seeks (offsetsForTimes).
The Producer: Batching, Compression, Idempotence
A Kafka producer never sends one record per network call. The send() API places the record into
a per-partition RecordAccumulator queue. A background Sender thread groups
ready partitions into ProduceRequests and ships them. Two parameters tune the latency/throughput
knob: linger.ms (how long to wait for more records before sending, default 0) and
batch.size (max bytes per partition batch, default 16 KB).
Compression is applied at the batch level, not the record level. Kafka supports gzip,
snappy, lz4, and zstd. Brokers store the compressed batch as-is and
pass it through to consumers — the broker doesn't recompress. This is one of Kafka's killer optimisations:
compression becomes free for the broker and 5–10× cheaper for the network.
The idempotent producer (enable.idempotence=true, default since 3.0) attaches
a producer ID + sequence number per partition. The broker dedupes on retry, eliminating duplicates from
transient network failures. This costs nothing perceptible and is strictly better than not having it.
Replication and the ISR
Each partition has one leader and replication.factor - 1 followers
on other brokers. Producers and consumers always interact with the leader. Followers fetch from the leader
using a regular FetchRequest — the same one consumers use, just authenticated as a replica.
The in-sync replica set (ISR) is the set of replicas (including the leader) that have
caught up to within replica.lag.time.max.ms (default 30 s) of the leader's log end. The
leader maintains the ISR and writes a "high watermark" — the minimum offset replicated by all ISR members
— and only exposes records up to the high watermark to consumers.
Producers choose durability via acks: 0 (fire and forget), 1 (leader
ack only — losing leader before replication = data loss), all (leader waits for the ISR to
replicate). With acks=all + min.insync.replicas=2 + replication factor 3, you
can lose any one broker without losing data and without losing availability — the durability sweet spot
for production.
unclean.leader.election.enable is the safety vs availability dial. With false
(default since 1.0), if all ISR replicas die, the partition becomes unavailable rather than electing a
lagging out-of-sync replica that would lose data. Never set this to true unless you understand the
data-loss implications.
Consumer Groups and Rebalances
A consumer group is a set of consumers that cooperatively process all partitions of subscribed topics.
Each partition is assigned to exactly one consumer in the group; if you have more consumers than
partitions, the excess sit idle. Group membership is tracked by the group coordinator —
a broker assigned by hashing the group ID into the 50 partitions of the internal
__consumer_offsets topic.
When a member joins, leaves, or fails to heartbeat in session.timeout.ms, the coordinator
triggers a rebalance. The classic eager protocol stops the world: every consumer revokes
all partitions, the coordinator reassigns, then consumers fetch again. Modern Kafka (since 2.4) uses
cooperative incremental rebalancing by default — only the affected partitions are
reassigned, eliminating most stop-the-world latency.
Offsets are committed back to __consumer_offsets automatically (every
auto.commit.interval.ms) or manually (commitSync()/commitAsync()).
At-least-once is the default: commit after processing. Exactly-once requires the transactional producer
pattern below.
Exactly-Once Semantics
Kafka's EOS is two pieces. Idempotent producer guarantees no duplicates within a single
session/topic-partition. Transactions let a producer atomically write to multiple
partitions and commit consumer offsets in the same atomic unit, gated by a transaction
coordinator on the broker side and the internal __transaction_state topic.
The classical "consume-process-produce" pattern: a Streams app reads from input topic, processes, writes
to output topic, and commits the input offset — all transactionally. Consumers using
isolation.level=read_committed filter out aborted transactions, preserving end-to-end
exactly-once. The cost is meaningful: a few hundred microseconds of extra latency per commit, plus
coordinator overhead.
Critically, EOS only holds within Kafka. A side effect (e.g., HTTP call to a payment processor) cannot be made transactional with the offset commit. For "exactly-once" semantics that include external side-effects, you need an idempotency key on the side-effect plus the Kafka transaction.
KRaft: Goodbye, ZooKeeper
For a decade, Kafka used ZooKeeper to store cluster metadata (topics, partitions, ISR, ACLs). It worked,
but ZooKeeper had a different operational model, separate failure modes, and a hard ceiling around 200k
partitions per cluster. KRaft (KIP-500, GA in 3.3) replaces ZooKeeper with a Raft-based
metadata quorum running inside Kafka brokers themselves. The metadata is a regular Kafka topic
(__cluster_metadata) replicated via Raft across a small set of controller nodes.
Operational consequences: one system to operate, faster controller failover (seconds instead of tens of seconds), and the ability to scale to millions of partitions per cluster. As of Kafka 4.0 (2025), KRaft is the only supported mode — ZooKeeper is gone.
Tiered Storage
KIP-405 (GA in 3.6) lets brokers offload closed segments to object storage (S3, GCS, Azure Blob) while keeping the hot tail on local disk. Consumers reading old data fetch transparently from the remote tier — the broker streams it through. This collapses Kafka's traditional cost wall: previously, retaining 30 days of a 100 MB/s topic required 250 TB of expensive replicated SSD; with tiered storage it requires a few hours of SSD plus cheap S3.
The tradeoff is read latency for cold data (S3 round-trip, hundreds of ms vs sub-millisecond local). Acceptable for replay/CDC backfill workloads; not for low-latency consumers reading the tail.
Streams, Connect, Schema Registry
Kafka Streams is a JVM library that turns a Kafka cluster into a stream processor. Topology = source(s) → processors → sink(s). State stores back KTable joins and aggregations using local RocksDB plus changelog topics for durability — every state mutation is mirrored to a Kafka topic so a restarted instance can rebuild state by replaying the changelog.
Kafka Connect is a separate process pool for source/sink connectors (Debezium, JDBC, S3, Elasticsearch). It uses Kafka topics for offset and config storage and provides exactly-once semantics for compatible connectors via the transactional producer.
Schema Registry (Confluent open-source, plus Apicurio, Karapace) is an external service that stores Avro/Protobuf/JSON-Schema definitions. Producers register schemas and write schema IDs into the message header; consumers fetch the schema by ID. It enforces compatibility rules (backward, forward, full) so schemas can evolve without breaking consumers.
Tradeoffs and When Not To Use Kafka
Kafka is heavy. A production cluster needs three+ brokers, schema registry, monitoring, and an operations
team that understands ISR dynamics. The 99th-percentile latency floor is around 5–10 ms per produce
round-trip with acks=all — acceptable for streaming, painful for request-response.
Don't use Kafka as a task queue (use SQS, RabbitMQ, or Redis Streams — they have proper visibility timeouts and per-message ack semantics). Don't use it for low-throughput pub/sub (NATS or even a database trigger is cheaper). Do use it for CDC, event sourcing, durable event integration, log aggregation, and stream processing pipelines — the workloads it was designed for.
Kafka vs Alternative Streaming Systems
| Kafka | RedPanda | Pulsar | NATS JetStream | |
|---|---|---|---|---|
| Implementation | JVM (Scala/Java) | C++, Raft, no JVM | Java + BookKeeper | Go |
| Wire protocol | Kafka protocol | Kafka-compatible | Custom (Pulsar protocol) | NATS protocol |
| Storage architecture | Per-broker partitioned log | Per-broker partitioned log | Decoupled compute/storage | Per-stream log files |
| Coordination | KRaft (Raft) | Raft (per-partition) | ZooKeeper or etcd | JetStream Raft |
| Tiered storage | KIP-405 (3.6+) | Yes (S3) | Native (BookKeeper offload) | Yes |
| Best for | The default. Massive ecosystem. | Lower latency, less ops | Geo-replication, multi-tenant | Edge, IoT, lighter workloads |
| Per-message ack semantics | No (offset-based) | No (offset-based) | Yes (cumulative + individual) | Yes |
FAQ
How many partitions should my topic have?
Enough that your peak consumer throughput, divided by per-consumer throughput, is at most partition count. Plus headroom for growth. Plus consideration for the rebalance cost (more partitions = longer rebalances historically, mitigated by cooperative rebalancing). Don't over-partition: each partition has fixed file-handle and memory cost. 12–48 is common for medium throughput; 100+ is high-throughput; 1000+ requires careful planning.
What does acks=all actually guarantee?
That the leader has written your record to its local log AND every member of the in-sync replica set has fetched and persisted it. Combined with min.insync.replicas=2 and replication factor 3, you can lose any single broker without data loss or write rejection. With min.insync.replicas=replication.factor, you trade availability for stronger durability — losing one broker rejects writes.
Why are my consumers rebalancing constantly?
Almost always one of: (a) consumer is taking longer than max.poll.interval.ms between polls — heartbeat thread keeps the session alive but the coordinator kicks them on poll timeout, (b) session.timeout.ms is shorter than realistic processing time, (c) deploys without using static membership (group.instance.id), causing a rebalance per pod restart. Static membership + cooperative rebalancing eliminate most pain.
Should I use Kafka Streams or Flink?
Streams if you want a library deployed as part of your service (no separate cluster), don't need windowing beyond a few minutes, and your state fits in local RocksDB. Flink if you need: high-throughput stateful processing with state in TB, complex windowing, exactly-once with non-Kafka sinks, or a true clustered execution model with checkpointing. Both are great. Don't pick Spark Streaming for new work in 2026.
How does compaction (log compaction, not RocksDB) work?
Compacted topics retain only the latest value per key, asynchronously. Used for change-data-capture and KTables: the topic becomes a queryable changelog where the latest record per key is the current state. Old segments are rewritten by a compaction thread, dropping superseded keys. Tombstones (null value) signal deletion and are retained for delete.retention.ms.
Is RedPanda just "Kafka without the JVM"?
Mostly yes for the API surface — wire-compatible, drop-in for most clients. Internally it's a per-core thread-per-shard architecture using Seastar, Raft per partition (no separate ISR concept), and tighter integration of metadata. Operational benefits: lower memory, simpler deployment, less GC tail latency. Tradeoff: smaller ecosystem, fewer connectors, you're on RedPanda's release cadence not Apache's.
What replaced ZooKeeper exactly?
An internal Kafka topic __cluster_metadata replicated via Raft across a small set of controller-mode brokers. Three to five controller nodes form the metadata quorum; data brokers fetch metadata as a standard Kafka log. The KRaft mode is the only supported mode in Kafka 4.0+. ZooKeeper is no longer required, supported, or recommended.