Spark on Kubernetes

Driver Pod, Executor Pods, and the Shuffle Problem

Spark on Kubernetes treats the K8s API as the resource manager. The driver runs as a pod; the driver requests executor pods directly from the K8s API; the K8s scheduler places them. There is no separate Hadoop cluster, no Mesos, no YARN. The model is much closer to how modern microservice infrastructure actually runs in production.

The main hard problem is shuffle. K8s pods are designed to be ephemeral, but Spark shuffle writes gigabytes of intermediate data that must survive long enough for the consuming stage to read it. Solving this — via external shuffle services, persistent volumes, or remote shuffle systems — is most of what makes Spark on K8s production-ready.

Architecture

Kubernetes cluster Driver pod spark-driver-xyz SparkContext + DAG requests executor pods K8s API creates pods configmaps services Executor 1 cores: 4 mem: 8 GiB PVC: shuffle Executor 2 cores: 4 mem: 8 GiB PVC: shuffle ...N Shuffle layer (one of these) PVC per executor (survives pod restart) External Shuffle Service (DaemonSet, decoupled) Remote Shuffle (Celeborn / S3) (scale-out shuffle store) K8s scheduler + Cluster Autoscaler scales nodes up/down based on pending pods; supports spot/preemptible nodes

Key Numbers

2.3
Spark version that added native K8s support
3.1
Spark version where K8s went GA
1 driver
+ N executors per Spark app
30-60%
Typical cost savings vs static YARN cluster
~5s
Pod startup time (image cached)
~30s
Pod startup (cold image pull)
3.4+
Spark version with shuffle tracking improvements

spark-submit on Kubernetes

# Submit directly to the K8s API
$ spark-submit \
    --master k8s://https://kubernetes.default.svc \
    --deploy-mode cluster \
    --name etl-job \
    --class com.example.ETL \
    --conf spark.kubernetes.namespace=spark \
    --conf spark.kubernetes.container.image=myrepo/spark:3.5.0 \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \
    --conf spark.executor.instances=10 \
    --conf spark.executor.cores=4 \
    --conf spark.executor.memory=8g \
    --conf spark.driver.cores=2 \
    --conf spark.driver.memory=4g \
    local:///opt/jobs/etl-1.0.jar

# What happens:
# 1. spark-submit creates a Driver pod manifest and POSTs to API server
# 2. Driver pod starts, runs your main(), creates SparkContext
# 3. SparkContext (in cluster mode) talks to K8s API to create executor pods
# 4. Executors register back with the driver; tasks start
# 5. On shutdown, driver deletes executor pods and exits

spark-operator (Kubernetes-native)

# Install the operator (Helm)
$ helm repo add spark-operator https://kubeflow.github.io/spark-operator
$ helm install spark-op spark-operator/spark-operator -n spark-system

# Submit a job as a CRD
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: etl-job
  namespace: spark
spec:
  type: Scala
  mode: cluster
  image: myrepo/spark:3.5.0
  mainClass: com.example.ETL
  mainApplicationFile: local:///opt/jobs/etl-1.0.jar
  sparkVersion: 3.5.0
  driver:
    cores: 2
    memory: 4g
    serviceAccount: spark-sa
  executor:
    cores: 4
    instances: 10
    memory: 8g
  dynamicAllocation:
    enabled: true
    minExecutors: 2
    maxExecutors: 50
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3

The operator pattern fits Kubernetes idioms: declarative spec, controller reconciles, kubectl tools work, GitOps friendly. Failures are handled by the operator (driver pod restart, retry). For production pipelines, this is almost always the right choice over raw spark-submit.

Shuffle Persistence Options

# Option 1: PVC per executor (recommended starting point)
spark.kubernetes.executor.volumes.persistentVolumeClaim.shuffle.options.claimName=OnDemand
spark.kubernetes.executor.volumes.persistentVolumeClaim.shuffle.options.storageClass=fast-ssd
spark.kubernetes.executor.volumes.persistentVolumeClaim.shuffle.options.sizeLimit=200Gi
spark.kubernetes.executor.volumes.persistentVolumeClaim.shuffle.mount.path=/shuffle
spark.local.dir=/shuffle

# Each executor gets its own PVC, dynamically provisioned.
# When the executor pod dies, the PVC is reattached to the replacement pod
# (if the executor restarts on the same node) or recreated.

# Option 2: External shuffle service as DaemonSet (Spark 3.4+)
# Run a shuffle service pod on each node; executors write to it via Unix socket
# Shuffle data lives on the host, not the executor pod
# Killed by KIP-style design; production is moving toward Celeborn

# Option 3: Apache Celeborn (push-based remote shuffle)
spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager
spark.celeborn.master.endpoints=celeborn-master:9097
# Shuffle data in a separate cluster of Celeborn workers
# Executors are stateless; can be killed any time

Dynamic Allocation

spark.dynamicAllocation.enabled = true
spark.dynamicAllocation.minExecutors = 2
spark.dynamicAllocation.maxExecutors = 100
spark.dynamicAllocation.executorIdleTimeout = 60s
spark.dynamicAllocation.shuffleTracking.enabled = true
spark.dynamicAllocation.shuffleTracking.timeout = 7200s

# How it works without an external shuffle service:
# 1. Idle executor with no shuffle data: removed after executorIdleTimeout
# 2. Idle executor that holds shuffle data needed by future stages:
#    kept alive (tracked) until shuffle is no longer needed
# 3. Driver continually adjusts executor count based on backlog and finished stages

# Common gotcha: executors with shuffle responsibility look "idle" but cannot
# be killed. spark.dynamicAllocation.shuffleTracking.timeout caps this so
# truly stuck executors get released eventually.

Spark on K8s vs YARN

AspectYARNKubernetes
Resource managerResourceManager + NodeManagersK8s API + kubelet
Multi-tenancyQueues, capacity schedulerNamespaces + ResourceQuotas + scheduler plugins
LocalityHDFS-aware, prefers local dataNone native (object storage assumed)
ShuffleExternal shuffle service in NodeManagerPVC, DaemonSet shuffle service, or remote (Celeborn)
Container imageTar-based environment, often bake everything inOCI image — clean, versioned, pullable
AutoscalingPer-cluster onlyPer-job + cluster autoscaler + spot/preemptible
Operational toolingHadoop ecosystemkubectl, Helm, Argo CD, Flux, mature observability

Tradeoffs and When Not to Use

K8s wins when

  • You already run microservices on K8s (one platform to operate)
  • Job mix is bursty — autoscaling pays off
  • You can use spot/preemptible nodes safely
  • Object storage (S3, GCS) is your data lake

Stick with YARN when

  • You have a working Hadoop cluster with HDFS data locality
  • Your team has zero K8s operational experience
  • You need very large shared shuffles without remote shuffle infrastructure
  • Tight integration with existing Hive/Tez workloads

Frequently Asked Questions

Do I still need YARN if I run Spark on Kubernetes?

No, K8s replaces YARN as the resource manager. Spark on Kubernetes mode launches the driver as a pod, the driver requests executor pods directly from the K8s API, and pods are scheduled by the K8s scheduler. No Hadoop or YARN cluster needed. The tradeoff is that K8s does not give you the same data-locality scheduling YARN had with HDFS — though that is rarely the bottleneck on modern object-storage-based deployments.

What happens to shuffle data when an executor pod dies?

By default, shuffle files are written to ephemeral pod storage. When the pod dies (preemption, autoscaling), the shuffle data is gone and Spark must recompute the lost stage. Solutions: 1) Run an external shuffle service as a DaemonSet so shuffle outlives executors. 2) Use a persistent volume (PVC) per executor that survives pod restart. 3) Use S3-backed shuffle plugins (e.g. AWS S3 shuffle plugin, Apache Celeborn). The PVC approach is the most production-tested for K8s.

What is the spark-operator and when should I use it?

spark-operator is a Kubernetes operator that lets you submit Spark jobs as a SparkApplication CRD. Instead of `spark-submit`, you `kubectl apply -f my-job.yaml`. The operator handles driver pod creation, retries, monitoring, and integration with K8s tooling (kubectl, Argo, Flux). For ad-hoc jobs from a developer machine, spark-submit is fine. For production pipelines, the operator gives you GitOps, declarative configs, and proper K8s-native lifecycle management.

Does dynamic allocation work on Kubernetes?

Yes, since Spark 3.0. Set spark.dynamicAllocation.enabled=true and spark.dynamicAllocation.shuffleTracking.enabled=true. Without an external shuffle service, Spark on K8s uses 'shuffle tracking' — it keeps an executor alive as long as it holds shuffle data needed by future stages, even if it is otherwise idle. Idle executors with no shuffle responsibilities are scaled down. This works well for ETL with multiple stages of varying parallelism.

How does Spark on K8s compare to YARN for cost?

On the cloud, K8s usually wins through bin-packing. YARN typically gives you static cluster sizing — you pay for the peak. K8s with cluster autoscaler and spot instances scales pods up and down per job, often achieving 30-60% lower compute spend. On-prem, YARN may still be cheaper if you already have a Hadoop cluster, but the operational overhead is moving people to K8s anyway.