⚡ DAG Execution

From Code to Cluster — How Spark Plans and Executes Your Job

When you call an action (collect, save, count), Spark builds a DAG (Directed Acyclic Graph) of transformations, splits it into stages at shuffle boundaries, and executes each stage as parallel tasks — one per partition. Understanding this pipeline is key to writing fast Spark jobs.

🔄 Execution Pipeline

Click through to see how Spark transforms your code into cluster work

Your Code
spark.read.parquet("orders")          # Read: 4 partitions
  .filter(col("amount") > 100)        # Narrow transform
  .groupBy("region")                  # Wide transform (SHUFFLE)
  .agg(sum("amount").alias("total"))  # Aggregation
  .orderBy(desc("total"))             # Wide transform (SHUFFLE)
  .collect()                          # ACTION → triggers execution
Ready Click "Next Step" to begin
DAG will appear here as you step through

📊 Stage Execution: Tasks in Parallel

Each stage runs tasks in parallel — one per partition. Adjust executors to see the effect.

Tasks4
Waves2
Parallelism2x
Est. Time2 waves

🔀 Narrow vs Wide Transforms

Narrow (map, filter): each output partition depends on one input partition. No data movement. Wide (groupBy, join, orderBy): requires shuffle — redistributing data across the network. Shuffles are stage boundaries and the #1 performance bottleneck.

📦 Shuffle: The Expensive Operation

During shuffle, every executor writes data to local disk (shuffle write), then other executors fetch it over the network (shuffle read). This involves disk I/O + network I/O + serialization. Minimize shuffles: use broadcast joins, pre-partition data, avoid unnecessary groupBys.

🧮 Catalyst Optimizer

Before execution, Spark's Catalyst optimizes the logical plan: predicate pushdown, column pruning, join reordering. Then Tungsten generates optimized bytecode. Use .explain(true) to see all plans.

💾 Caching & Persistence

Use .cache() or .persist() to avoid recomputing DAGs. Spark stores data in memory (MEMORY_ONLY) or disk (MEMORY_AND_DISK). Cache when you'll reuse a DataFrame multiple times — but don't over-cache (steals memory from execution).

🛠️ Performance Tuning Best Practices

📏 Right-Size Partitions

spark.sql.shuffle.partitions = 200 # default
# Rule: 128MB per partition

Too few = OOM or slow tasks. Too many = scheduler overhead. Target 100-200MB per partition for shuffles.

📡 Broadcast Joins

spark.sql.autoBroadcastJoinThreshold = 10MB
# or: broadcast(small_df)

If one side of a join is small (<100MB), broadcast it to avoid a shuffle. Massive speedup for dimension joins.

🔍 Data Skew

# Check: df.groupBy("key").count().orderBy(desc("count"))
# Fix: salting or AQE skew join

One key with 100x more data = one task runs 100x longer. Enable AQE (adaptive query execution) to auto-handle.

⚡ AQE (Adaptive Query Execution)

spark.sql.adaptive.enabled = true
spark.sql.adaptive.coalescePartitions.enabled = true

Spark 3.0+: dynamically adjusts partition count, handles skew, and optimizes joins at runtime. Always enable this.

🗑️ Avoid collect() on Large Data

# ❌ df.collect() → pulls ALL data to driver
# ✅ df.write.parquet("output/")

collect() brings everything to the driver's memory. Use write/show/take for large DataFrames.

📊 Spark UI: Your Best Friend

http://driver:4040/jobs/

Check: stage timeline (stragglers?), shuffle read/write size, GC time, task skew. The UI tells you exactly what's slow.