Message Queue

A message queue decouples producers from consumers in time, in load, and in failure domains. The choice of broker shapes everything downstream: delivery semantics (at-most-once vs at-least-once vs exactly-once), ordering guarantees (per-partition vs global), retention (delete on ack vs replayable log), and operational profile (Kafka's ZooKeeper-or-KRaft vs SQS's zero-ops vs NATS's embeddable simplicity). Picking the wrong broker is one of the most expensive long-lived mistakes in a system.

Architecture

Capacity Estimation

Metric	Value	Notes
Kafka throughput/broker	~1 M msg/s	tuned, 10 GbE
RabbitMQ throughput/node	~50 K msg/s	persistent, mirrored
SQS throughput	~3 K msg/s/queue	standard; FIFO 300/s
NATS throughput	~10 M msg/s	unpersisted core NATS
Retention typical	7 d (Kafka)	configurable to forever
Partition rebalance	seconds	tens of partitions

Pub-Sub vs Queue

Two access patterns dominate:

Queue: each message is consumed by exactly one of N workers. The classic work-distribution pattern (RabbitMQ, SQS). On consumer failure, messages requeue.
Pub-sub: each message is delivered to all subscribers. Logical events broadcast to many consumers. Each subscriber tracks its own position.

Modern brokers blend both: Kafka exposes "consumer groups" where members share a partition set (queue semantics) but multiple groups read independently (pub-sub semantics). One topic serves many use cases.

Broker-based vs Brokerless

Broker-based (Kafka, Rabbit, SQS) — producer writes to a central service, durability and routing decisions live there. Easy to reason about, easy to monitor; the broker is a SPOF you must replicate.
Brokerless (ZeroMQ, raw NATS core) — producer and consumer connect directly; the library handles framing and routing. Lower latency, no central scaling concern; you lose persistence and centralized observability.

Brokerless is fine for in-process or per-service IPC; for cross-system event propagation, broker-based is the default. Use brokerless when latency < 1 ms matters and durability does not.

Ordering Guarantees

Per-partition ordering (Kafka, Pulsar) — messages within one partition are ordered; across partitions, no guarantee. Producers route by key (e.g., user_id) so all events for one user land on one partition.
Global ordering — only achievable by funneling everything to one partition / one broker, which caps throughput. Rarely worth it; partition-level is enough for almost every use case.
FIFO guaranteed at delivery (SQS FIFO) — broker enforces order per group ID, but throughput is capped at 300 msg/s without batching.
Unordered (Standard SQS, NATS core) — lowest overhead, fine for idempotent or independent events.

At-least-once vs Exactly-once

The hard truth: end-to-end exactly-once across producer, broker, and consumer requires cooperation from all three.

At-most-once: send once, never retry. Loses messages on failure. Acceptable only for sampled metrics.
At-least-once: send and retry until ack. Duplicates possible; consumer must dedupe. Default for most production use.
Exactly-once: requires idempotent producers (Kafka idempotent producer keeps a per-PID sequence) and transactional consumers (offsets committed atomically with output writes).

Kafka's "exactly-once semantics" is real but bounded: it covers Kafka-to-Kafka pipelines (read from topic A, transform, write to topic B). Once you write to an external system (Postgres, S3, an HTTP API), you are back to at-least-once unless that system has its own idempotency story.

Practical pattern: use at-least-once delivery + idempotent consumers. The dedup key flows with the message; the consumer checks a "seen" store before applying side effects.

Dead Letter Queues

A DLQ catches messages that failed processing repeatedly. Best practice:

Fail-fast on poison messages — track per-message attempt count; after N (typically 5–10), route to DLQ instead of redelivering forever.
Preserve context — original payload, last error, attempt history.
Alert on rate, not absolute count — DLQ growing fast = systemic issue; slow trickle = expected long-tail.
Replay tooling — ops can re-drive a slice of DLQ after fixing the bug. Without replay, the DLQ is a graveyard, not an alert.

The Brokers Compared

Kafka — partitioned log, durable retention, high throughput. Default for event streaming, CDC, log aggregation. Operational weight: ZooKeeper (legacy) or KRaft (new). Ecosystem: Connect, Streams, ksqlDB. See Kafka internals.
RabbitMQ — AMQP, classic queue semantics, rich routing (topic exchanges, headers, fanout). Best for task queues with complex routing. Lower throughput than Kafka; mirrored queues for HA.
Redis Streams — Kafka-light embedded in Redis. Good for < 100 K/s, in-memory persistence. Compelling when you already run Redis.
NATS / NATS JetStream — brokerless core for pub-sub; JetStream adds durability and Kafka-like log semantics. Lightweight binary, easy to embed; growing in IoT / edge.
SQS — managed, zero ops. Standard (best-effort order, near-infinite throughput) and FIFO (strict order, capped). Use when you do not want to run a broker.
Apache Pulsar — tiered architecture (compute brokers + BookKeeper bookies for storage), multi-tenancy primitives, cross-region replication built-in. Operationally heavy but powerful for multi-tenant SaaS.

Failure Modes

Consumer lag runaway — producer outpaces consumer; lag grows; retention window passes; messages drop. Auto-scale consumers on lag; alert at 5-min lag.
Hot partition — one key dominates; all traffic on one broker. Repartition or use a sub-key (compound).
Rebalance storm — consumer joins/leaves trigger group rebalance; processing pauses for seconds. Use static membership (Kafka 2.4+) to bound rebalance cost.
Schema break — producer publishes new schema, consumer crashes parsing. Schema registry with backward-compat enforcement.

FAQ

Kafka or SQS for a new system?

SQS for simple work queues with manageable volume and no replay needs. Kafka for event streaming, CDC, log aggregation, or anywhere you might want to replay history.

How many partitions should I create?

Start with the consumer parallelism you expect at peak (one consumer per partition is the upper bound). Over-partitioning wastes broker resources; under-partitioning caps throughput. 10–100 is a good range to start.

What about message size?

Keep payloads small (< 1 MB). Larger blobs go in S3 with the message containing only the key. Kafka chokes on multi-MB messages; brokers everywhere prefer small.

Do I need a schema registry?

Once you have > 2 services on a topic, yes. Without it, producers will break consumers and you will discover at runtime. Confluent Schema Registry, Apicurio, AWS Glue Schema Registry are options.