Spark on Kubernetes

Driver Pod, Executor Pods, and the Shuffle Problem

Spark on Kubernetes treats the K8s API as the resource manager. The driver runs as a pod; the driver requests executor pods directly from the K8s API; the K8s scheduler places them. There is no separate Hadoop cluster, no Mesos, no YARN. The model is much closer to how modern microservice infrastructure actually runs in production.

The main hard problem is shuffle. K8s pods are designed to be ephemeral, but Spark shuffle writes gigabytes of intermediate data that must survive long enough for the consuming stage to read it. Solving this — via external shuffle services, persistent volumes, or remote shuffle systems — is most of what makes Spark on K8s production-ready.

Architecture

Key Numbers

2.3

Spark version that added native K8s support

3.1

Spark version where K8s went GA

1 driver

+ N executors per Spark app

30-60%

Typical cost savings vs static YARN cluster

~5s

Pod startup time (image cached)

~30s

Pod startup (cold image pull)

3.4+

Spark version with shuffle tracking improvements

spark-submit on Kubernetes

# Submit directly to the K8s API
$ spark-submit \
    --master k8s://https://kubernetes.default.svc \
    --deploy-mode cluster \
    --name etl-job \
    --class com.example.ETL \
    --conf spark.kubernetes.namespace=spark \
    --conf spark.kubernetes.container.image=myrepo/spark:3.5.0 \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \
    --conf spark.executor.instances=10 \
    --conf spark.executor.cores=4 \
    --conf spark.executor.memory=8g \
    --conf spark.driver.cores=2 \
    --conf spark.driver.memory=4g \
    local:///opt/jobs/etl-1.0.jar

# What happens:
# 1. spark-submit creates a Driver pod manifest and POSTs to API server
# 2. Driver pod starts, runs your main(), creates SparkContext
# 3. SparkContext (in cluster mode) talks to K8s API to create executor pods
# 4. Executors register back with the driver; tasks start
# 5. On shutdown, driver deletes executor pods and exits

spark-operator (Kubernetes-native)

# Install the operator (Helm)
$ helm repo add spark-operator https://kubeflow.github.io/spark-operator
$ helm install spark-op spark-operator/spark-operator -n spark-system

# Submit a job as a CRD
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: etl-job
  namespace: spark
spec:
  type: Scala
  mode: cluster
  image: myrepo/spark:3.5.0
  mainClass: com.example.ETL
  mainApplicationFile: local:///opt/jobs/etl-1.0.jar
  sparkVersion: 3.5.0
  driver:
    cores: 2
    memory: 4g
    serviceAccount: spark-sa
  executor:
    cores: 4
    instances: 10
    memory: 8g
  dynamicAllocation:
    enabled: true
    minExecutors: 2
    maxExecutors: 50
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3

The operator pattern fits Kubernetes idioms: declarative spec, controller reconciles, kubectl tools work, GitOps friendly. Failures are handled by the operator (driver pod restart, retry). For production pipelines, this is almost always the right choice over raw spark-submit.

Shuffle Persistence Options

# Option 1: PVC per executor (recommended starting point)
spark.kubernetes.executor.volumes.persistentVolumeClaim.shuffle.options.claimName=OnDemand
spark.kubernetes.executor.volumes.persistentVolumeClaim.shuffle.options.storageClass=fast-ssd
spark.kubernetes.executor.volumes.persistentVolumeClaim.shuffle.options.sizeLimit=200Gi
spark.kubernetes.executor.volumes.persistentVolumeClaim.shuffle.mount.path=/shuffle
spark.local.dir=/shuffle

# Each executor gets its own PVC, dynamically provisioned.
# When the executor pod dies, the PVC is reattached to the replacement pod
# (if the executor restarts on the same node) or recreated.

# Option 2: External shuffle service as DaemonSet (Spark 3.4+)
# Run a shuffle service pod on each node; executors write to it via Unix socket
# Shuffle data lives on the host, not the executor pod
# Killed by KIP-style design; production is moving toward Celeborn

# Option 3: Apache Celeborn (push-based remote shuffle)
spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager
spark.celeborn.master.endpoints=celeborn-master:9097
# Shuffle data in a separate cluster of Celeborn workers
# Executors are stateless; can be killed any time

Dynamic Allocation

spark.dynamicAllocation.enabled = true
spark.dynamicAllocation.minExecutors = 2
spark.dynamicAllocation.maxExecutors = 100
spark.dynamicAllocation.executorIdleTimeout = 60s
spark.dynamicAllocation.shuffleTracking.enabled = true
spark.dynamicAllocation.shuffleTracking.timeout = 7200s

# How it works without an external shuffle service:
# 1. Idle executor with no shuffle data: removed after executorIdleTimeout
# 2. Idle executor that holds shuffle data needed by future stages:
#    kept alive (tracked) until shuffle is no longer needed
# 3. Driver continually adjusts executor count based on backlog and finished stages

# Common gotcha: executors with shuffle responsibility look "idle" but cannot
# be killed. spark.dynamicAllocation.shuffleTracking.timeout caps this so
# truly stuck executors get released eventually.

Spark on K8s vs YARN

Aspect	YARN	Kubernetes
Resource manager	ResourceManager + NodeManagers	K8s API + kubelet
Multi-tenancy	Queues, capacity scheduler	Namespaces + ResourceQuotas + scheduler plugins
Locality	HDFS-aware, prefers local data	None native (object storage assumed)
Shuffle	External shuffle service in NodeManager	PVC, DaemonSet shuffle service, or remote (Celeborn)
Container image	Tar-based environment, often bake everything in	OCI image — clean, versioned, pullable
Autoscaling	Per-cluster only	Per-job + cluster autoscaler + spot/preemptible
Operational tooling	Hadoop ecosystem	kubectl, Helm, Argo CD, Flux, mature observability

Tradeoffs and When Not to Use

K8s wins when

You already run microservices on K8s (one platform to operate)
Job mix is bursty — autoscaling pays off
You can use spot/preemptible nodes safely
Object storage (S3, GCS) is your data lake

Stick with YARN when

You have a working Hadoop cluster with HDFS data locality
Your team has zero K8s operational experience
You need very large shared shuffles without remote shuffle infrastructure
Tight integration with existing Hive/Tez workloads

Frequently Asked Questions

Do I still need YARN if I run Spark on Kubernetes?

No, K8s replaces YARN as the resource manager. Spark on Kubernetes mode launches the driver as a pod, the driver requests executor pods directly from the K8s API, and pods are scheduled by the K8s scheduler. No Hadoop or YARN cluster needed. The tradeoff is that K8s does not give you the same data-locality scheduling YARN had with HDFS — though that is rarely the bottleneck on modern object-storage-based deployments.

What happens to shuffle data when an executor pod dies?

By default, shuffle files are written to ephemeral pod storage. When the pod dies (preemption, autoscaling), the shuffle data is gone and Spark must recompute the lost stage. Solutions: 1) Run an external shuffle service as a DaemonSet so shuffle outlives executors. 2) Use a persistent volume (PVC) per executor that survives pod restart. 3) Use S3-backed shuffle plugins (e.g. AWS S3 shuffle plugin, Apache Celeborn). The PVC approach is the most production-tested for K8s.

What is the spark-operator and when should I use it?

spark-operator is a Kubernetes operator that lets you submit Spark jobs as a SparkApplication CRD. Instead of `spark-submit`, you `kubectl apply -f my-job.yaml`. The operator handles driver pod creation, retries, monitoring, and integration with K8s tooling (kubectl, Argo, Flux). For ad-hoc jobs from a developer machine, spark-submit is fine. For production pipelines, the operator gives you GitOps, declarative configs, and proper K8s-native lifecycle management.

Does dynamic allocation work on Kubernetes?

Yes, since Spark 3.0. Set spark.dynamicAllocation.enabled=true and spark.dynamicAllocation.shuffleTracking.enabled=true. Without an external shuffle service, Spark on K8s uses 'shuffle tracking' — it keeps an executor alive as long as it holds shuffle data needed by future stages, even if it is otherwise idle. Idle executors with no shuffle responsibilities are scaled down. This works well for ETL with multiple stages of varying parallelism.

How does Spark on K8s compare to YARN for cost?

On the cloud, K8s usually wins through bin-packing. YARN typically gives you static cluster sizing — you pay for the peak. K8s with cluster autoscaler and spot instances scales pods up and down per job, often achieving 30-60% lower compute spend. On-prem, YARN may still be cheaper if you already have a Hadoop cluster, but the operational overhead is moving people to K8s anyway.