Design a Distributed Job Scheduler

Priority Queues, Worker Pools, Exactly-Once Execution, Cron Scheduling, and Failure Recovery

A distributed job scheduler orchestrates the execution of tasks across a fleet of workers. It handles priority-based scheduling (high-priority jobs run first), delayed/cron jobs (execute at a future time or on a recurring schedule), exactly-once execution (via fencing tokens and idempotency keys), and failure recovery (retries with exponential backoff, dead letter queues). At scale, systems like Airflow, Temporal, and Celery schedule millions of jobs per day with sub-second dispatch latency, strong consistency guarantees, and automatic worker pool scaling.

Job Queue Visualization

A priority queue ensures high-priority jobs are processed first. The delay queue holds jobs scheduled for future execution -- when their time arrives, they move to the ready queue automatically.

Ready Queue 0

timer fires

→

Delay Queue 0

Total Enqueued0

Processed0

In Ready Queue0

In Delay Queue0

Worker Pool Scaling Simulator

Adjust the number of workers, job arrival rate, and average job duration to see how queue depth, utilization, and throughput change in real time.

Workers: 5

Job arrival rate: 10 jobs/sec

Avg job duration: 3 sec

Queue Depth: 0

Utilization0%

Throughput0/sec

Avg Wait0s

StatusIdle

Exactly-Once Execution

Distributed job execution must guarantee exactly-once semantics using fencing tokens and idempotency keys. Each worker acquires a lease with a monotonically increasing fencing token before executing a job.

Click a scenario to see the execution flow

Success Failed/Rejected Pending/Retry Fencing Token

Cron Expression Parser

Parse cron expressions to understand recurring job schedules. Enter a cron expression or use the presets to see the next execution times.

Cron Expression:

Every 5 minutes

Next 10 executions:

Failure Handling & Retry Flow

When a job fails, it is retried with exponential backoff. After exhausting retries, it moves to the Dead Letter Queue (DLQ) for manual inspection.

Max retries: 3

Base delay: 2 sec

Backoff multiplier: 2x

Dead Letter Queue 0

No failed jobs yet

Architecture

Client / Cron Trigger

→

API Gateway

→

Job Scheduler

↓

Job Queue
Redis / Kafka

↓

Worker Pool

Pull & execute

↓

Job DB
PostgreSQL

Job metadata

↓

Result Store
S3 / DB

Output & logs

Dead Letter Queue

Failed jobs after max retries

Monitoring

Prometheus + Grafana
Alerts on failures

Key Design Decisions

Push vs Pull Model Push (Scheduler assigns) Scheduler picks worker + pushes job
Better load balancing
Scheduler is single point of failure
Used by Kubernetes Jobs
 
vs
 Pull (Workers poll) Workers pull from queue when ready
Natural backpressure
Workers self-regulate capacity
Used by Celery, Sidekiq
 

At-Least-Once vs Exactly-Once At-Least-Once Simpler: retry on failure
Jobs must be idempotent
Duplicates possible
Most systems use this
 
vs
 Exactly-Once Fencing tokens + idempotency keys
Higher complexity & latency
Needed for payments, billing
Requires distributed coordination
 

Queue Backend: Redis vs Kafka

Redis (sorted sets for priority, pub/sub for notifications): low latency (~1ms), simple, but limited durability. Best for <100K jobs/sec. Kafka (partitioned topics, consumer groups): high throughput, durable, exactly-once via transactions. Best for >100K jobs/sec with strict ordering guarantees. Hybrid approach: Redis for priority queue + Kafka for event log.

Cron Scheduling at Scale

Store cron schedules in DB. A scheduler service polls every second for due jobs. Use leader election (via ZooKeeper/etcd) to ensure only one scheduler instance fires each cron job. Pre-compute next N execution times and insert into a delay queue sorted by execution time. Partition cron jobs by hash to distribute load across scheduler instances.

Capacity Estimation

Estimate the infrastructure needed for your job scheduling system.

Jobs per day (millions): 10

Avg job duration (sec): 5

Peak multiplier: 3x

Workers needed (peak)--

Queue storage--

DB entries/day--

Avg jobs/sec--