๐จ Partitions & Offsets
The Anatomy of a Kafka Topic โ and How to Choose Partition Counts
A Kafka topic is split into partitions โ ordered, append-only logs. Each message gets an offset (sequence number) within its partition. Producers choose partitions by key hash (or round-robin). Consumer groups divide partitions among members โ each partition is read by exactly one consumer.
๐ค Producer: How Messages Land in Partitions
๐ฅ Consumer Groups: Partition Assignment
Each partition is assigned to exactly one consumer. More consumers than partitions = idle consumers.
๐ How Many Partitions?
Rule of thumb: target throughput รท per-partition throughput. If you need 100MB/s and each partition handles ~10MB/s, use 10 partitions. More partitions = more parallelism, but also more overhead (file handles, replication traffic, leader elections).
๐ Key-Based Ordering
Messages with the same key always go to the same partition โ guaranteed ordering per key. This is how you get ordered event streams per user/entity. But beware: hot keys (one key with 80% of traffic) create partition hotspots.
๐ Offset Tracking
Each consumer tracks its position per partition via committed offsets (stored in __consumer_offsets topic). On restart, consumers resume from their last commit. Auto-commit (default) vs manual commit trades convenience for exactly-once guarantees.
โ๏ธ Rebalancing
When consumers join/leave, Kafka rebalances โ reassigning partitions. During rebalance, consumption pauses (stop-the-world). Use CooperativeStickyAssignor to minimize partition movement and reduce downtime.
๐ ๏ธ Operational Best Practices
๐ Monitor Consumer Lag
kafka-consumer-groups.sh --describe --group mygroup If lag grows steadily, you need more consumers or faster processing. Alert when lag > N minutes.
๐ข Never Decrease Partitions
# You can only ADD partitions, not remove
kafka-topics.sh --alter --partitions 12 Adding partitions changes key routing โ messages with the same key may land in different partitions. Plan ahead.
โฐ Retention & Compaction
retention.ms=604800000 # 7 days
cleanup.policy=compact # or delete Delete: remove old messages by time/size. Compact: keep latest per key. Use compact for changelog topics.
๐ง Producer Tuning
batch.size=65536
linger.ms=5
acks=all Batch + linger for throughput. acks=all for durability. acks=1 for latency. Never acks=0 in production.
๐ฅ Under-Replicated Partitions
kafka-topics.sh --describe --under-replicated Non-zero URP = data at risk. Check broker health, disk I/O, network. This is your #1 Kafka alert.
๐ฏ Rack-Aware Replication
broker.rack=us-east-1a Ensure replicas spread across racks/AZs. Losing one rack shouldn't lose any partition's quorum.