Kafka Streams vs Flink

Two Stream-Processing Models, Two Operational Models

Kafka Streams and Apache Flink solve the same problem — stream processing — with very different architectures. Kafka Streams is a Java library you embed in your application; Kafka itself coordinates the work. Flink is a distributed system with a JobManager and TaskManagers, optimized for low-latency stateful computations at scale.

The choice usually comes down to two questions: how much latency can you tolerate, and how much operational complexity do you want? Streams trades latency and feature depth for operational simplicity. Flink trades operational footprint for feature depth and performance.

Architecture Comparison

Kafka Streams Apache Flink App instance 1 Streams library RocksDB local App instance 2 Streams library RocksDB local Kafka cluster consumer-group coordinates instances changelog topics back state + Embedded in your service + No separate cluster to operate + Scale via consumer-group rebalance - Kafka source/sink only - Higher latency (transaction-based) JobManager ZK / HA TaskManager slots: 4 TaskManager slots: 4 TaskMgr slots: 4 Checkpoint store: S3 / HDFS / GCS + Kafka, Kinesis, Pulsar, JDBC, files, sockets + True event-time + watermarks + Async snapshots, low latency - Cluster to operate - Higher learning curve

Key Numbers

200ms
Kafka Streams typical end-to-end latency
5-50ms
Flink typical end-to-end latency
100s GB
Streams realistic state per instance
10s TB
Flink realistic state per operator
100ms
Default commit.interval.ms in Streams
1-5min
Typical Flink checkpoint interval
1 process
Streams ops footprint (your app)

Kafka Streams: Library Code

// build.gradle: just add the dependency
// implementation 'org.apache.kafka:kafka-streams:3.7.0'

Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "orders-processor");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka-1:9092");
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, "exactly_once_v2");

StreamsBuilder builder = new StreamsBuilder();
KStream<String,Order> orders = builder.stream("orders");

KTable<String,Long> revenuePerCustomer = orders
  .groupByKey()
  .aggregate(
    () -> 0L,
    (key, order, total) -> total + order.amount,
    Materialized.<String,Long,KeyValueStore<Bytes,byte[]>>as("revenue-store")
      .withKeySerde(Serdes.String())
      .withValueSerde(Serdes.Long())
  );

revenuePerCustomer.toStream().to("revenue", Produced.with(Serdes.String(), Serdes.Long()));

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();
// That's it. The topology runs in this JVM.
// Run multiple instances; Kafka rebalances partitions across them.

Flink: Distributed Job

// build.gradle: flink-streaming-java + flink-connector-kafka
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(60_000);  // 60s checkpoint interval
env.getCheckpointConfig().setCheckpointStorage("s3://my-bucket/flink-checkpoints");
env.setStateBackend(new EmbeddedRocksDBStateBackend(true));  // incremental

KafkaSource<Order> source = KafkaSource.<Order>builder()
  .setBootstrapServers("kafka-1:9092")
  .setTopics("orders")
  .setGroupId("flink-orders-processor")
  .setDeserializer(new OrderDeserializer())
  .setStartingOffsets(OffsetsInitializer.committedOffsets())
  .build();

DataStream<Order> orders = env.fromSource(source,
    WatermarkStrategy.<Order>forBoundedOutOfOrderness(Duration.ofSeconds(30))
      .withTimestampAssigner((o, ts) -> o.eventTime),
    "kafka-orders");

orders
  .keyBy(o -> o.customerId)
  .window(TumblingEventTimeWindows.of(Time.minutes(5)))
  .aggregate(new RevenueAggregator())
  .sinkTo(KafkaSink.<Long>builder()
    .setBootstrapServers("kafka-1:9092")
    .setRecordSerializer(...)
    .setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
    .setTransactionalIdPrefix("flink-revenue-")
    .build());

env.execute("orders-processor");
// Submit this jar to a Flink cluster:
// $ flink run -m yarn-cluster orders-processor.jar

Exactly-Once: Two Approaches

AspectKafka StreamsFlink
MechanismKafka transactions per input batchAsync distributed snapshots (Chandy-Lamport)
Commit intervalcommit.interval.ms (default 100ms)checkpoint interval (typical 1-5 min)
Latency floorcommit.interval.ms (records visible only after commit)Per-record (records emitted continuously, snapshot is async)
Source/sink scopeKafka onlyAny source + 2PC-capable sinks (Kafka, JDBC, files, ...)
State recoveryReplay changelog topic into RocksDBRestore checkpoint from S3/HDFS
Recovery timeProportional to state size (full replay)Constant (state is already snapshotted)

State Stores: RocksDB vs Flink Backends

# Kafka Streams: state always lives in local RocksDB
# State directory per instance:
/tmp/kafka-streams/orders-processor/0_5/rocksdb/revenue-store/

# Each store is backed by a Kafka changelog topic (auto-created):
orders-processor-revenue-store-changelog

# Restart sequence:
# 1. Read local RocksDB checkpoint marker (if any)
# 2. Replay changelog from that offset to rebuild state
# 3. Resume processing

# Flink: pluggable state backends
# 1. HashMapStateBackend  — heap, low-latency, limited size
# 2. EmbeddedRocksDBStateBackend  — local RocksDB + remote checkpoints
# 3. ChangelogStateBackend  — incremental + low-latency restore (Flink 1.16+)

# Flink incremental checkpoints only ship RocksDB SST file diffs,
# making 1 TB state checkpointable in seconds.

Event-Time and Watermarks

CapabilityKafka StreamsFlink
Event-time windowingYes (basic)Yes (rich, well-understood)
Watermark generationPer-partition onlyPluggable, with idleness detection
Allowed latenessLimited (grace period, drop after)Configurable, side outputs for late events
Session windowsFixed-gap onlyDynamic gap function
Window typesTumbling, hopping, sliding, session+ Custom window assigners and triggers

Tradeoffs and When to Choose Which

Choose Kafka Streams when

  • Source and sink are both Kafka
  • You already have a JVM service to embed in
  • State fits on one instance's local disk (~hundreds of GB)
  • Sub-second latency is enough
  • You don't want a separate cluster to operate

Choose Flink when

  • Sources/sinks beyond Kafka (Kinesis, Pulsar, JDBC, files)
  • State is multi-TB or grows unboundedly
  • Sub-100ms latency required
  • Complex event-time semantics (sessions, late data)
  • Iterative algorithms (graph processing, ML)

Frequently Asked Questions

Why is Kafka Streams a library and not a cluster?

Kafka Streams was designed to embed in your existing service. You write a Java application that calls KafkaStreams.start() and the topology runs in that process. Multiple instances form a logical processing group via Kafka's consumer group protocol — Kafka itself coordinates partition assignment. This means no separate cluster to operate, no resource manager, no job submission. The price is that you only get Kafka-level integration: state stores live in RocksDB on local disk, and exactly-once relies on Kafka transactions.

How does Flink achieve exactly-once across non-Kafka sinks?

Flink uses asynchronous distributed snapshots (the Chandy-Lamport algorithm). Periodically, the JobManager injects a checkpoint barrier into every input stream. Each operator captures its state when the barrier passes, including in-flight buffers. When all operators have snapshotted, the checkpoint is durable in remote storage (S3, HDFS). On failure, all operators roll back to the last checkpoint. For sinks that support it (Kafka, JDBC with two-phase commit, file sinks), the checkpoint coordinates with sink commits to give exactly-once end-to-end.

Can Kafka Streams do event-time windows?

Yes, but with limitations. Kafka Streams supports event-time semantics with custom timestamp extractors and watermarks, but its watermark generation is per-partition and simpler than Flink's. Late events past the grace period are dropped. Flink's watermark mechanism is more sophisticated, supports per-key allowed-lateness with side outputs, and has richer windowing primitives (session windows with arbitrary gap functions, sliding windows of arbitrary granularity).

Where does state live in each system?

Kafka Streams: RocksDB on each instance's local disk. State is partitioned by the same key as the input topic, and changes are continuously written to a Kafka changelog topic for fault tolerance. On restart, RocksDB is rebuilt by replaying the changelog. Flink: state is in the operator's heap or RocksDB (configurable per state backend), checkpointed asynchronously to remote storage. Flink can have very large state (terabytes per operator) because the state backend supports incremental checkpoints.

Which has lower latency?

Flink, by an order of magnitude for stateful jobs. Kafka Streams commits transactions on each input batch (default ~100ms), so end-to-end latency is bounded below by the transaction interval plus consumer fetch latency, typically 200-500ms. Flink uses async checkpointing, so per-record latency is single-digit milliseconds. For latency-critical applications (sub-100ms), Flink is the standard choice. For 'fast enough' (seconds OK), Kafka Streams is much simpler to operate.