⚡ DAG Execution
From Code to Cluster — How Spark Plans and Executes Your Job
When you call an action (collect, save, count), Spark builds a DAG (Directed Acyclic Graph) of transformations, splits it into stages at shuffle boundaries, and executes each stage as parallel tasks — one per partition. Understanding this pipeline is key to writing fast Spark jobs.
🔄 Execution Pipeline
Click through to see how Spark transforms your code into cluster work
Your Code
spark.read.parquet("orders") # Read: 4 partitions
.filter(col("amount") > 100) # Narrow transform
.groupBy("region") # Wide transform (SHUFFLE)
.agg(sum("amount").alias("total")) # Aggregation
.orderBy(desc("total")) # Wide transform (SHUFFLE)
.collect() # ACTION → triggers execution Ready Click "Next Step" to begin
DAG will appear here as you step through
📊 Stage Execution: Tasks in Parallel
Each stage runs tasks in parallel — one per partition. Adjust executors to see the effect.
Tasks4
Waves2
Parallelism2x
Est. Time2 waves
🔀 Narrow vs Wide Transforms
Narrow (map, filter): each output partition depends on one input partition. No data movement. Wide (groupBy, join, orderBy): requires shuffle — redistributing data across the network.