⚡ DAG Execution
From Code to Cluster — How Spark Plans and Executes Your Job
When you call an action (collect, save, count), Spark builds a DAG (Directed Acyclic Graph) of transformations, splits it into stages at shuffle boundaries, and executes each stage as parallel tasks — one per partition. Understanding this pipeline is key to writing fast Spark jobs.
🔄 Execution Pipeline
Click through to see how Spark transforms your code into cluster work
spark.read.parquet("orders") # Read: 4 partitions
.filter(col("amount") > 100) # Narrow transform
.groupBy("region") # Wide transform (SHUFFLE)
.agg(sum("amount").alias("total")) # Aggregation
.orderBy(desc("total")) # Wide transform (SHUFFLE)
.collect() # ACTION → triggers execution 📊 Stage Execution: Tasks in Parallel
Each stage runs tasks in parallel — one per partition. Adjust executors to see the effect.
🔀 Narrow vs Wide Transforms
Narrow (map, filter): each output partition depends on one input partition. No data movement. Wide (groupBy, join, orderBy): requires shuffle — redistributing data across the network. Shuffles are stage boundaries and the #1 performance bottleneck.
📦 Shuffle: The Expensive Operation
During shuffle, every executor writes data to local disk (shuffle write), then other executors fetch it over the network (shuffle read). This involves disk I/O + network I/O + serialization. Minimize shuffles: use broadcast joins, pre-partition data, avoid unnecessary groupBys.
🧮 Catalyst Optimizer
Before execution, Spark's Catalyst optimizes the logical plan: predicate pushdown,
column pruning, join reordering. Then Tungsten generates optimized bytecode.
Use .explain(true) to see all plans.
💾 Caching & Persistence
Use .cache() or .persist() to avoid recomputing DAGs.
Spark stores data in memory (MEMORY_ONLY) or disk (MEMORY_AND_DISK).
Cache when you'll reuse a DataFrame multiple times — but don't over-cache (steals memory from execution).
🛠️ Performance Tuning Best Practices
📏 Right-Size Partitions
spark.sql.shuffle.partitions = 200 # default
# Rule: 128MB per partition Too few = OOM or slow tasks. Too many = scheduler overhead. Target 100-200MB per partition for shuffles.
📡 Broadcast Joins
spark.sql.autoBroadcastJoinThreshold = 10MB
# or: broadcast(small_df) If one side of a join is small (<100MB), broadcast it to avoid a shuffle. Massive speedup for dimension joins.
🔍 Data Skew
# Check: df.groupBy("key").count().orderBy(desc("count"))
# Fix: salting or AQE skew join One key with 100x more data = one task runs 100x longer. Enable AQE (adaptive query execution) to auto-handle.
⚡ AQE (Adaptive Query Execution)
spark.sql.adaptive.enabled = true
spark.sql.adaptive.coalescePartitions.enabled = true Spark 3.0+: dynamically adjusts partition count, handles skew, and optimizes joins at runtime. Always enable this.
🗑️ Avoid collect() on Large Data
# ❌ df.collect() → pulls ALL data to driver
# ✅ df.write.parquet("output/") collect() brings everything to the driver's memory. Use write/show/take for large DataFrames.
📊 Spark UI: Your Best Friend
http://driver:4040/jobs/ Check: stage timeline (stragglers?), shuffle read/write size, GC time, task skew. The UI tells you exactly what's slow.