Spark on Kubernetes
Driver Pod, Executor Pods, and the Shuffle Problem
Spark on Kubernetes treats the K8s API as the resource manager. The driver runs as a pod; the driver requests executor pods directly from the K8s API; the K8s scheduler places them. There is no separate Hadoop cluster, no Mesos, no YARN. The model is much closer to how modern microservice infrastructure actually runs in production.
The main hard problem is shuffle. K8s pods are designed to be ephemeral, but Spark shuffle writes gigabytes of intermediate data that must survive long enough for the consuming stage to read it. Solving this — via external shuffle services, persistent volumes, or remote shuffle systems — is most of what makes Spark on K8s production-ready.
Architecture
Key Numbers
spark-submit on Kubernetes
# Submit directly to the K8s API
$ spark-submit \
--master k8s://https://kubernetes.default.svc \
--deploy-mode cluster \
--name etl-job \
--class com.example.ETL \
--conf spark.kubernetes.namespace=spark \
--conf spark.kubernetes.container.image=myrepo/spark:3.5.0 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \
--conf spark.executor.instances=10 \
--conf spark.executor.cores=4 \
--conf spark.executor.memory=8g \
--conf spark.driver.cores=2 \
--conf spark.driver.memory=4g \
local:///opt/jobs/etl-1.0.jar
# What happens:
# 1. spark-submit creates a Driver pod manifest and POSTs to API server
# 2. Driver pod starts, runs your main(), creates SparkContext
# 3. SparkContext (in cluster mode) talks to K8s API to create executor pods
# 4. Executors register back with the driver; tasks start
# 5. On shutdown, driver deletes executor pods and exits spark-operator (Kubernetes-native)
# Install the operator (Helm)
$ helm repo add spark-operator https://kubeflow.github.io/spark-operator
$ helm install spark-op spark-operator/spark-operator -n spark-system
# Submit a job as a CRD
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: etl-job
namespace: spark
spec:
type: Scala
mode: cluster
image: myrepo/spark:3.5.0
mainClass: com.example.ETL
mainApplicationFile: local:///opt/jobs/etl-1.0.jar
sparkVersion: 3.5.0
driver:
cores: 2
memory: 4g
serviceAccount: spark-sa
executor:
cores: 4
instances: 10
memory: 8g
dynamicAllocation:
enabled: true
minExecutors: 2
maxExecutors: 50
restartPolicy:
type: OnFailure
onFailureRetries: 3 The operator pattern fits Kubernetes idioms: declarative spec, controller reconciles, kubectl tools work, GitOps friendly. Failures are handled by the operator (driver pod restart, retry). For production pipelines, this is almost always the right choice over raw spark-submit.
Shuffle Persistence Options
# Option 1: PVC per executor (recommended starting point)
spark.kubernetes.executor.volumes.persistentVolumeClaim.shuffle.options.claimName=OnDemand
spark.kubernetes.executor.volumes.persistentVolumeClaim.shuffle.options.storageClass=fast-ssd
spark.kubernetes.executor.volumes.persistentVolumeClaim.shuffle.options.sizeLimit=200Gi
spark.kubernetes.executor.volumes.persistentVolumeClaim.shuffle.mount.path=/shuffle
spark.local.dir=/shuffle
# Each executor gets its own PVC, dynamically provisioned.
# When the executor pod dies, the PVC is reattached to the replacement pod
# (if the executor restarts on the same node) or recreated.
# Option 2: External shuffle service as DaemonSet (Spark 3.4+)
# Run a shuffle service pod on each node; executors write to it via Unix socket
# Shuffle data lives on the host, not the executor pod
# Killed by KIP-style design; production is moving toward Celeborn
# Option 3: Apache Celeborn (push-based remote shuffle)
spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager
spark.celeborn.master.endpoints=celeborn-master:9097
# Shuffle data in a separate cluster of Celeborn workers
# Executors are stateless; can be killed any time Dynamic Allocation
spark.dynamicAllocation.enabled = true
spark.dynamicAllocation.minExecutors = 2
spark.dynamicAllocation.maxExecutors = 100
spark.dynamicAllocation.executorIdleTimeout = 60s
spark.dynamicAllocation.shuffleTracking.enabled = true
spark.dynamicAllocation.shuffleTracking.timeout = 7200s
# How it works without an external shuffle service:
# 1. Idle executor with no shuffle data: removed after executorIdleTimeout
# 2. Idle executor that holds shuffle data needed by future stages:
# kept alive (tracked) until shuffle is no longer needed
# 3. Driver continually adjusts executor count based on backlog and finished stages
# Common gotcha: executors with shuffle responsibility look "idle" but cannot
# be killed. spark.dynamicAllocation.shuffleTracking.timeout caps this so
# truly stuck executors get released eventually. Spark on K8s vs YARN
| Aspect | YARN | Kubernetes |
|---|---|---|
| Resource manager | ResourceManager + NodeManagers | K8s API + kubelet |
| Multi-tenancy | Queues, capacity scheduler | Namespaces + ResourceQuotas + scheduler plugins |
| Locality | HDFS-aware, prefers local data | None native (object storage assumed) |
| Shuffle | External shuffle service in NodeManager | PVC, DaemonSet shuffle service, or remote (Celeborn) |
| Container image | Tar-based environment, often bake everything in | OCI image — clean, versioned, pullable |
| Autoscaling | Per-cluster only | Per-job + cluster autoscaler + spot/preemptible |
| Operational tooling | Hadoop ecosystem | kubectl, Helm, Argo CD, Flux, mature observability |
Tradeoffs and When Not to Use
K8s wins when
- You already run microservices on K8s (one platform to operate)
- Job mix is bursty — autoscaling pays off
- You can use spot/preemptible nodes safely
- Object storage (S3, GCS) is your data lake
Stick with YARN when
- You have a working Hadoop cluster with HDFS data locality
- Your team has zero K8s operational experience
- You need very large shared shuffles without remote shuffle infrastructure
- Tight integration with existing Hive/Tez workloads
Frequently Asked Questions
Do I still need YARN if I run Spark on Kubernetes?
No, K8s replaces YARN as the resource manager. Spark on Kubernetes mode launches the driver as a pod, the driver requests executor pods directly from the K8s API, and pods are scheduled by the K8s scheduler. No Hadoop or YARN cluster needed. The tradeoff is that K8s does not give you the same data-locality scheduling YARN had with HDFS — though that is rarely the bottleneck on modern object-storage-based deployments.
What happens to shuffle data when an executor pod dies?
By default, shuffle files are written to ephemeral pod storage. When the pod dies (preemption, autoscaling), the shuffle data is gone and Spark must recompute the lost stage. Solutions: 1) Run an external shuffle service as a DaemonSet so shuffle outlives executors. 2) Use a persistent volume (PVC) per executor that survives pod restart. 3) Use S3-backed shuffle plugins (e.g. AWS S3 shuffle plugin, Apache Celeborn). The PVC approach is the most production-tested for K8s.
What is the spark-operator and when should I use it?
spark-operator is a Kubernetes operator that lets you submit Spark jobs as a SparkApplication CRD. Instead of `spark-submit`, you `kubectl apply -f my-job.yaml`. The operator handles driver pod creation, retries, monitoring, and integration with K8s tooling (kubectl, Argo, Flux). For ad-hoc jobs from a developer machine, spark-submit is fine. For production pipelines, the operator gives you GitOps, declarative configs, and proper K8s-native lifecycle management.
Does dynamic allocation work on Kubernetes?
Yes, since Spark 3.0. Set spark.dynamicAllocation.enabled=true and spark.dynamicAllocation.shuffleTracking.enabled=true. Without an external shuffle service, Spark on K8s uses 'shuffle tracking' — it keeps an executor alive as long as it holds shuffle data needed by future stages, even if it is otherwise idle. Idle executors with no shuffle responsibilities are scaled down. This works well for ETL with multiple stages of varying parallelism.
How does Spark on K8s compare to YARN for cost?
On the cloud, K8s usually wins through bin-packing. YARN typically gives you static cluster sizing — you pay for the peak. K8s with cluster autoscaler and spot instances scales pods up and down per job, often achieving 30-60% lower compute spend. On-prem, YARN may still be cheaper if you already have a Hadoop cluster, but the operational overhead is moving people to K8s anyway.