Spark vs Flink

Batch-First vs Streaming-First, and the Tradeoffs That Follow

Spark and Flink are both top-tier distributed processing engines, both written in Scala/Java, both running on JVM, both supporting batch and streaming. They look interchangeable from a brochure. They are not — the underlying execution model is fundamentally different, and that difference cascades into latency, state management, exactly-once semantics, and ecosystem fit.

Spark is batch-first: an RDD/DataFrame engine where streaming is incremental batches. Flink is streaming-first: a continuous dataflow engine where batch is bounded streams. For analytical SQL on a data lake, Spark wins on tooling and optimizer. For low-latency stateful streaming, Flink wins on pretty much every dimension.

Architecture Comparison

Spark (batch-first) Flink (streaming-first) Driver DAG Scheduler Task Scheduler Executor Executor Executor RDD/DataFrame -> stages -> tasks Streaming = micro-batches Catalyst + Tungsten + AQE Strengths: Batch SQL ecosystem (Delta, Iceberg) Mature optimizer + ML libraries Largest community JobManager Checkpoint Coordinator Resource Manager TaskManager TaskManager TaskManager Continuous dataflow operators Async distributed snapshots RocksDB state, incremental checkpoints Strengths: Low-latency event-time stream processing Multi-TB state with fast checkpoints True exactly-once across many connectors

Key Numbers

100ms+
Spark Streaming typical latency
5-50ms
Flink typical latency
2014
Spark 1.0 release
2014
Flink 0.5 release (originally Stratosphere)
Catalyst
Spark optimizer (mature, ~150 rules)
Calcite
Flink SQL optimizer (Apache Calcite)
10s TB
Flink production state size

Streaming Models

AspectSpark Structured StreamingFlink DataStream
ExecutionMicro-batch (continuous trigger experimental)True per-record streaming
Latency~100ms-seconds~ms (single-digit typical)
BackpressureImplicit via batch timingExplicit, credit-based, well-instrumented
Event-timeWatermark per source partitionWatermark per parallel subtask, idleness-aware
Out-of-orderallowedLateness; older droppedSide outputs for late events; configurable retention
Window typesTumbling, sliding, session (fixed gap)+ Session with custom gap, custom assigners
State APIsflatMapGroupsWithState (typed but limited)Rich KeyedState API: ValueState, ListState, MapState

Exactly-Once Approaches

# Spark: micro-batch atomicity
# Each micro-batch:
# 1. Read offsets from source (checkpoint)
# 2. Process the batch
# 3. Write outputs (transactional sinks for exactly-once)
# 4. Commit offsets to checkpoint
# Crash mid-batch: replay from last checkpoint, idempotent reprocess
# Limitation: latency floor = batch interval
# Sink support: Kafka, Delta, Iceberg, file sink, foreachBatch

# Flink: async distributed snapshots
# Periodically:
# 1. JobManager injects checkpoint barrier into every input
# 2. Each operator snapshots its state when barrier passes
# 3. Sinks pre-commit (2PC); commit on checkpoint complete
# 4. JobManager records the global checkpoint as durable
# Crash: restore all operators from latest checkpoint, sinks roll back uncommitted
# Latency floor: per-record (snapshots are async)
# Sink support: Kafka, JDBC w/ XA, files, any 2PC sink

Batch Workloads

AspectSparkFlink
SQL optimizerCatalyst (~150 rules) + AQE runtimeApache Calcite-based, fewer specialized rules
Code generationTungsten whole-stage codegenSome codegen, less aggressive than Tungsten
Lakehouse ecosystemDelta Lake, Iceberg, Hudi all nativeIceberg connector solid; Delta/Hudi via 3rd party
MLMLlib (mature, retired-but-present), Spark ML pipelinesFlinkML in maintenance mode
Adaptive Query ExecutionDefault-on, dynamic skew + coalesce + join switchLess mature runtime adaptation
Cloud supportEMR, Dataproc, Databricks, Synapse, GlueKinesis Analytics, Ververica Cloud, less native cloud

State Management Details

# Spark Structured Streaming state stores
spark.sql.streaming.stateStore.providerClass =
  org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider
# State at each stateful operator partition, checkpointed each batch.
# Recovery: load latest snapshot + replay deltas to current batch.
# Realistic state size: ~100s GB per executor.

# Flink state backends
state.backend = rocksdb
state.backend.incremental = true
state.backend.changelog.enabled = true   # Flink 1.16+
# Async snapshots, only RocksDB SST diffs uploaded.
# Recovery from checkpoint is constant-time (lazy load).
# Realistic state size: 10s of TB per operator.

# Why the gap?
# - Flink built state-first; checkpoints are core, not retrofitted
# - Flink supports incremental checkpoints since 1.3 (2017)
# - Spark only got RocksDB state in 3.2 (2021); full incremental in 3.5

When Each Wins

Spark wins for

  • Analytical batch SQL on data lakes (Delta/Iceberg)
  • ETL pipelines with Catalyst-friendly SQL
  • ML workflows (MLlib + integrations)
  • Streaming where seconds latency is fine and code reuse with batch matters
  • Teams already on Databricks / Spark cloud services

Flink wins for

  • Sub-100ms latency stream processing
  • Complex event-time semantics, custom windows
  • Multi-TB state, fast incremental checkpoints
  • Many non-Kafka sources/sinks (Kinesis, Pulsar, JDBC XA)
  • CEP (complex event processing) and stateful pattern matching

Ecosystem Notes

# Spark
# Cloud-managed: Databricks, EMR, Dataproc, Synapse, Glue, Fabric
# Lakehouse: Delta Lake (Linux Foundation), Iceberg (Apache), Hudi (Apache)
# Notebook integration: Jupyter, Databricks, Zeppelin
# Languages: Scala, Java, Python (PySpark), R (SparkR), SQL
# Key vendors: Databricks, Cloudera, Confluent (for Streams)

# Flink
# Cloud-managed: Amazon Kinesis Data Analytics, Ververica Cloud, Aliyun
# Companion: Flink CDC (Change Data Capture), Flink ML
# Languages: Java, Scala, Python (PyFlink), SQL
# Key vendors: Ververica (data Artisans), Alibaba, Cloudera

# Beam (portability layer)
# Lets you write once, run on Spark, Flink, Dataflow, Samza
# Tradeoff: feature ceiling = lowest common denominator + translation overhead
# Used by Google Dataflow customers, less common elsewhere

Frequently Asked Questions

Why is Spark called 'batch-first' if it has Structured Streaming?

Spark started as a batch RDD engine; streaming was retrofitted on top as micro-batches. The execution model is fundamentally 'run a small batch every interval'. Flink, in contrast, was designed as a streaming dataflow runtime from day one — every operator processes records continuously, and batch is implemented as a special case of bounded streams. This shows up in latency (Flink is ~10x lower) and in features (Flink's event-time, watermarks, and savepoints are deeper).

Which has better state management?

Flink, by a wide margin. Flink's state backends (RocksDB with incremental checkpoints, ChangelogStateBackend) routinely handle multi-TB state per operator with sub-minute checkpoints. Spark Structured Streaming's state stores are more limited — RocksDB support arrived in 3.2 and is still maturing. For state-heavy applications (stream joins with weeks of retention, sessionization at scale), Flink is the standard choice.

Which is better for batch?

Spark, decisively. Catalyst + Tungsten + AQE form the most polished batch-SQL stack in the open-source ecosystem, and the surrounding tooling (Delta Lake, Iceberg, Hudi, Spark UI, broad cloud support) makes Spark the default for analytical batch processing. Flink supports batch via the same DataStream/Table APIs, but the ecosystem and tooling are thinner, and the query optimizer is less mature than Catalyst.

Can the same code run on Spark and Flink?

Largely no. Spark uses Datasets/DataFrames; Flink uses DataStream/DataSet (now unified into DataStream + Table API). The Apache Beam project provides a portable API, and your job can run on either Spark or Flink as a runner — but Beam adds a translation layer with its own performance characteristics, and most production users pick a runtime and stick with it.

Which has better Kubernetes support?

Both are mature on K8s. Flink has a native operator (Flink Kubernetes Operator) and was designed with long-running jobs in mind, so its K8s integration handles streaming-specific concerns like checkpoint storage and HA out of the box. Spark on K8s is also production-ready (3.1+) but is more oriented toward batch-style job submission. For long-running streaming jobs on K8s, Flink's lifecycle is slightly cleaner; for ad-hoc batch jobs, Spark's pattern is simpler.