⚡ DAG Execution

From Code to Cluster — How Spark Plans and Executes Your Job

When you call an action (collect, save, count), Spark builds a DAG (Directed Acyclic Graph) of transformations, splits it into stages at shuffle boundaries, and executes each stage as parallel tasks — one per partition. Understanding this pipeline is key to writing fast Spark jobs.

🔄 Execution Pipeline

Click through to see how Spark transforms your code into cluster work

Your Code
spark.read.parquet("orders")          # Read: 4 partitions
  .filter(col("amount") > 100)        # Narrow transform
  .groupBy("region")                  # Wide transform (SHUFFLE)
  .agg(sum("amount").alias("total"))  # Aggregation
  .orderBy(desc("total"))             # Wide transform (SHUFFLE)
  .collect()                          # ACTION → triggers execution
Ready Click "Next Step" to begin
DAG will appear here as you step through

📊 Stage Execution: Tasks in Parallel

Each stage runs tasks in parallel — one per partition. Adjust executors to see the effect.

Tasks4
Waves2
Parallelism2x
Est. Time2 waves

🔀 Narrow vs Wide Transforms

Narrow (map, filter): each output partition depends on one input partition. No data movement. Wide (groupBy, join, orderBy): requires shuffle — redistributing data across the network.