Kafka Internals

Kafka is, at its core, a distributed append-only log with a clever partitioning model and a battle-hardened replication protocol. Everything else — Streams, Connect, Schema Registry, KRaft — is built on that primitive. Understanding the partition (the unit of ordering, parallelism, and replication), the ISR (the durability contract), and the consumer group rebalance (the coordination algorithm) is enough to predict almost any Kafka behaviour in production. This hub assembles those primitives into a clear mental model.

Grounded in Apache Kafka 3.x, the KIP archive, and the Confluent reference architecture.

Topic, Partition, Segment, Replica

Topic: orders (3 partitions, replication factor 3) Producer A Producer B Broker 1 P0 leader P1 follower P2 follower Broker 2 P0 follower P1 leader P2 follower Broker 3 P0 follower P1 follower P2 leader Partition log on disk: append-only segments + index files 00000000.log offsets 0 to 99,999 00100000.log offsets 100k to 199,999 00200000.log (active) offsets 200k to HW + .index (offset, byte) + .timeindex (ts, offset) + .snapshot (idempotence) Consumer group: payment-service (3 instances, 3 partitions, 1:1) consumer-1 - P0 consumer-2 - P1 consumer-3 - P2 offsets in __consumer_offsets heartbeat to coordinator

Key Numbers

Default replication
3
Default segment size
1 GiB
Default partitions
1 (must tune)
Producer linger.ms
0 (latency)
Default batch.size
16 KB
Replica fetch quantum
1 MB
__consumer_offsets parts
50

Why Kafka Exists

The Gap
By 2010, LinkedIn had hundreds of services emitting events. JMS-style queues coupled producers and consumers, scaled poorly, and didn't keep history. Hadoop was great for batch but ingestion was a mess of one-off pipelines. Nobody had a unified, replayable, high-throughput log.
The Insight
If you treat the log as the primary abstraction (not the message, not the queue), you get pub/sub, queues, change-data-capture, and event sourcing for free. Disk is fast for sequential I/O — append-only logs are exactly that. Decouple producers from consumers via offsets and you can replay history forever.
The Result
Kafka became the de facto event bus. By 2024 over 80% of Fortune 100 companies ran it. Every modern data pipeline, from CDC to feature stores, treats the topic as the integration point. The log won.
✦ Live

Consumer Groups

Partition assignment, rebalances, offset commits, and how Kafka achieves at-least-once delivery across N parallel consumers without coordinating per-message

✦ Live

Partitions & Segments

How a topic becomes ordered logs on disk, segment rotation, the index files, and why the partition is the unit of parallelism, ordering, and replication

Coming soon

Producer Internals

Batching, linger.ms, compression, idempotency, and the accumulator — what happens between send() and the broker ACK

Coming soon

Replication & ISR

Leader/follower fetch, the in-sync replica set, min.insync.replicas, and how acks=all gives you durability without paying for synchronous quorum on every write

Coming soon

Exactly-Once Semantics

Idempotent producer, transactional producer, the transaction coordinator, and why EOS is end-to-end-only with read_committed consumers

Coming soon

KRaft (Kafka Raft)

The metadata quorum that replaced ZooKeeper — how controller election, metadata replication, and log-based config storage finally collapsed into one system

Coming soon

Tiered Storage

Offloading old segments to S3/GCS while keeping the hot tail on local disk — KIP-405's economics for terabyte-retention topics

Coming soon

Kafka Streams vs Flink

Library vs framework, processor topology vs DataStream, RocksDB state stores vs managed state — when each shape wins

Topic, Partition, Segment

A topic is a logical name. It is split into a fixed number of partitions at creation time. Each partition is an ordered, immutable, append-only sequence of records identified by a monotonic 64-bit offset starting at zero. Ordering is guaranteed within a partition, never across them. The partition count is the unit of parallelism: a topic with 12 partitions can be consumed by up to 12 consumers in a single group simultaneously.

Each partition lives on a single broker as a directory containing a sequence of segments: a .log file plus .index and .timeindex files, all named by the base offset of the segment. The active segment is the one currently being appended to; it rolls when it exceeds segment.bytes (default 1 GiB) or segment.ms (default 7 days). Closed segments are immutable and become eligible for retention deletion or tiered offload.

The .index file is a sparse mapping from offset to byte position in the log file (one entry per index.interval.bytes, default 4 KiB). Reads binary-search the index, then linearly scan a tiny window of the log. The .timeindex mirrors this for timestamp-based seeks (offsetsForTimes).

The Producer: Batching, Compression, Idempotence

A Kafka producer never sends one record per network call. The send() API places the record into a per-partition RecordAccumulator queue. A background Sender thread groups ready partitions into ProduceRequests and ships them. Two parameters tune the latency/throughput knob: linger.ms (how long to wait for more records before sending, default 0) and batch.size (max bytes per partition batch, default 16 KB).

Compression is applied at the batch level, not the record level. Kafka supports gzip, snappy, lz4, and zstd. Brokers store the compressed batch as-is and pass it through to consumers — the broker doesn't recompress. This is one of Kafka's killer optimisations: compression becomes free for the broker and 5–10× cheaper for the network.

The idempotent producer (enable.idempotence=true, default since 3.0) attaches a producer ID + sequence number per partition. The broker dedupes on retry, eliminating duplicates from transient network failures. This costs nothing perceptible and is strictly better than not having it.

Replication and the ISR

Each partition has one leader and replication.factor - 1 followers on other brokers. Producers and consumers always interact with the leader. Followers fetch from the leader using a regular FetchRequest — the same one consumers use, just authenticated as a replica.

The in-sync replica set (ISR) is the set of replicas (including the leader) that have caught up to within replica.lag.time.max.ms (default 30 s) of the leader's log end. The leader maintains the ISR and writes a "high watermark" — the minimum offset replicated by all ISR members — and only exposes records up to the high watermark to consumers.

Producers choose durability via acks: 0 (fire and forget), 1 (leader ack only — losing leader before replication = data loss), all (leader waits for the ISR to replicate). With acks=all + min.insync.replicas=2 + replication factor 3, you can lose any one broker without losing data and without losing availability — the durability sweet spot for production.

unclean.leader.election.enable is the safety vs availability dial. With false (default since 1.0), if all ISR replicas die, the partition becomes unavailable rather than electing a lagging out-of-sync replica that would lose data. Never set this to true unless you understand the data-loss implications.

Consumer Groups and Rebalances

A consumer group is a set of consumers that cooperatively process all partitions of subscribed topics. Each partition is assigned to exactly one consumer in the group; if you have more consumers than partitions, the excess sit idle. Group membership is tracked by the group coordinator — a broker assigned by hashing the group ID into the 50 partitions of the internal __consumer_offsets topic.

When a member joins, leaves, or fails to heartbeat in session.timeout.ms, the coordinator triggers a rebalance. The classic eager protocol stops the world: every consumer revokes all partitions, the coordinator reassigns, then consumers fetch again. Modern Kafka (since 2.4) uses cooperative incremental rebalancing by default — only the affected partitions are reassigned, eliminating most stop-the-world latency.

Offsets are committed back to __consumer_offsets automatically (every auto.commit.interval.ms) or manually (commitSync()/commitAsync()). At-least-once is the default: commit after processing. Exactly-once requires the transactional producer pattern below.

Exactly-Once Semantics

Kafka's EOS is two pieces. Idempotent producer guarantees no duplicates within a single session/topic-partition. Transactions let a producer atomically write to multiple partitions and commit consumer offsets in the same atomic unit, gated by a transaction coordinator on the broker side and the internal __transaction_state topic.

The classical "consume-process-produce" pattern: a Streams app reads from input topic, processes, writes to output topic, and commits the input offset — all transactionally. Consumers using isolation.level=read_committed filter out aborted transactions, preserving end-to-end exactly-once. The cost is meaningful: a few hundred microseconds of extra latency per commit, plus coordinator overhead.

Critically, EOS only holds within Kafka. A side effect (e.g., HTTP call to a payment processor) cannot be made transactional with the offset commit. For "exactly-once" semantics that include external side-effects, you need an idempotency key on the side-effect plus the Kafka transaction.

KRaft: Goodbye, ZooKeeper

For a decade, Kafka used ZooKeeper to store cluster metadata (topics, partitions, ISR, ACLs). It worked, but ZooKeeper had a different operational model, separate failure modes, and a hard ceiling around 200k partitions per cluster. KRaft (KIP-500, GA in 3.3) replaces ZooKeeper with a Raft-based metadata quorum running inside Kafka brokers themselves. The metadata is a regular Kafka topic (__cluster_metadata) replicated via Raft across a small set of controller nodes.

Operational consequences: one system to operate, faster controller failover (seconds instead of tens of seconds), and the ability to scale to millions of partitions per cluster. As of Kafka 4.0 (2025), KRaft is the only supported mode — ZooKeeper is gone.

Tiered Storage

KIP-405 (GA in 3.6) lets brokers offload closed segments to object storage (S3, GCS, Azure Blob) while keeping the hot tail on local disk. Consumers reading old data fetch transparently from the remote tier — the broker streams it through. This collapses Kafka's traditional cost wall: previously, retaining 30 days of a 100 MB/s topic required 250 TB of expensive replicated SSD; with tiered storage it requires a few hours of SSD plus cheap S3.

The tradeoff is read latency for cold data (S3 round-trip, hundreds of ms vs sub-millisecond local). Acceptable for replay/CDC backfill workloads; not for low-latency consumers reading the tail.

Streams, Connect, Schema Registry

Kafka Streams is a JVM library that turns a Kafka cluster into a stream processor. Topology = source(s) → processors → sink(s). State stores back KTable joins and aggregations using local RocksDB plus changelog topics for durability — every state mutation is mirrored to a Kafka topic so a restarted instance can rebuild state by replaying the changelog.

Kafka Connect is a separate process pool for source/sink connectors (Debezium, JDBC, S3, Elasticsearch). It uses Kafka topics for offset and config storage and provides exactly-once semantics for compatible connectors via the transactional producer.

Schema Registry (Confluent open-source, plus Apicurio, Karapace) is an external service that stores Avro/Protobuf/JSON-Schema definitions. Producers register schemas and write schema IDs into the message header; consumers fetch the schema by ID. It enforces compatibility rules (backward, forward, full) so schemas can evolve without breaking consumers.

Tradeoffs and When Not To Use Kafka

Kafka is heavy. A production cluster needs three+ brokers, schema registry, monitoring, and an operations team that understands ISR dynamics. The 99th-percentile latency floor is around 5–10 ms per produce round-trip with acks=all — acceptable for streaming, painful for request-response.

Don't use Kafka as a task queue (use SQS, RabbitMQ, or Redis Streams — they have proper visibility timeouts and per-message ack semantics). Don't use it for low-throughput pub/sub (NATS or even a database trigger is cheaper). Do use it for CDC, event sourcing, durable event integration, log aggregation, and stream processing pipelines — the workloads it was designed for.

Kafka vs Alternative Streaming Systems

KafkaRedPandaPulsarNATS JetStream
ImplementationJVM (Scala/Java)C++, Raft, no JVMJava + BookKeeperGo
Wire protocolKafka protocolKafka-compatibleCustom (Pulsar protocol)NATS protocol
Storage architecturePer-broker partitioned logPer-broker partitioned logDecoupled compute/storagePer-stream log files
CoordinationKRaft (Raft)Raft (per-partition)ZooKeeper or etcdJetStream Raft
Tiered storageKIP-405 (3.6+)Yes (S3)Native (BookKeeper offload)Yes
Best forThe default. Massive ecosystem.Lower latency, less opsGeo-replication, multi-tenantEdge, IoT, lighter workloads
Per-message ack semanticsNo (offset-based)No (offset-based)Yes (cumulative + individual)Yes

FAQ

How many partitions should my topic have?

Enough that your peak consumer throughput, divided by per-consumer throughput, is at most partition count. Plus headroom for growth. Plus consideration for the rebalance cost (more partitions = longer rebalances historically, mitigated by cooperative rebalancing). Don't over-partition: each partition has fixed file-handle and memory cost. 12–48 is common for medium throughput; 100+ is high-throughput; 1000+ requires careful planning.

What does acks=all actually guarantee?

That the leader has written your record to its local log AND every member of the in-sync replica set has fetched and persisted it. Combined with min.insync.replicas=2 and replication factor 3, you can lose any single broker without data loss or write rejection. With min.insync.replicas=replication.factor, you trade availability for stronger durability — losing one broker rejects writes.

Why are my consumers rebalancing constantly?

Almost always one of: (a) consumer is taking longer than max.poll.interval.ms between polls — heartbeat thread keeps the session alive but the coordinator kicks them on poll timeout, (b) session.timeout.ms is shorter than realistic processing time, (c) deploys without using static membership (group.instance.id), causing a rebalance per pod restart. Static membership + cooperative rebalancing eliminate most pain.

Should I use Kafka Streams or Flink?

Streams if you want a library deployed as part of your service (no separate cluster), don't need windowing beyond a few minutes, and your state fits in local RocksDB. Flink if you need: high-throughput stateful processing with state in TB, complex windowing, exactly-once with non-Kafka sinks, or a true clustered execution model with checkpointing. Both are great. Don't pick Spark Streaming for new work in 2026.

How does compaction (log compaction, not RocksDB) work?

Compacted topics retain only the latest value per key, asynchronously. Used for change-data-capture and KTables: the topic becomes a queryable changelog where the latest record per key is the current state. Old segments are rewritten by a compaction thread, dropping superseded keys. Tombstones (null value) signal deletion and are retained for delete.retention.ms.

Is RedPanda just "Kafka without the JVM"?

Mostly yes for the API surface — wire-compatible, drop-in for most clients. Internally it's a per-core thread-per-shard architecture using Seastar, Raft per partition (no separate ISR concept), and tighter integration of metadata. Operational benefits: lower memory, simpler deployment, less GC tail latency. Tradeoff: smaller ecosystem, fewer connectors, you're on RedPanda's release cadence not Apache's.

What replaced ZooKeeper exactly?

An internal Kafka topic __cluster_metadata replicated via Raft across a small set of controller-mode brokers. Three to five controller nodes form the metadata quorum; data brokers fetch metadata as a standard Kafka log. The KRaft mode is the only supported mode in Kafka 4.0+. ZooKeeper is no longer required, supported, or recommended.