Design a Distributed Job Scheduler

Priority Queues, Worker Pools, Exactly-Once Execution, Cron Scheduling, and Failure Recovery

A distributed job scheduler orchestrates the execution of tasks across a fleet of workers. It handles priority-based scheduling (high-priority jobs run first), delayed/cron jobs (execute at a future time or on a recurring schedule), exactly-once execution (via fencing tokens and idempotency keys), and failure recovery (retries with exponential backoff, dead letter queues). At scale, systems like Airflow, Temporal, and Celery schedule millions of jobs per day with sub-second dispatch latency, strong consistency guarantees, and automatic worker pool scaling.

Job Queue Visualization

A priority queue ensures high-priority jobs are processed first. The delay queue holds jobs scheduled for future execution -- when their time arrives, they move to the ready queue automatically.

Ready Queue 0

timer fires
β†’

Delay Queue 0

Total Enqueued0
Processed0
In Ready Queue0
In Delay Queue0

Worker Pool Scaling Simulator

Adjust the number of workers, job arrival rate, and average job duration to see how queue depth, utilization, and throughput change in real time.

Queue Depth: 0
Utilization0%
Throughput0/sec
Avg Wait0s
StatusIdle

Exactly-Once Execution

Distributed job execution must guarantee exactly-once semantics using fencing tokens and idempotency keys. Each worker acquires a lease with a monotonically increasing fencing token before executing a job.

Click a scenario to see the execution flow
Success Failed/Rejected Pending/Retry Fencing Token

Cron Expression Parser

Parse cron expressions to understand recurring job schedules. Enter a cron expression or use the presets to see the next execution times.

Every 5 minutes
Next 10 executions:

Failure Handling & Retry Flow

When a job fails, it is retried with exponential backoff. After exhausting retries, it moves to the Dead Letter Queue (DLQ) for manual inspection.

Dead Letter Queue 0

No failed jobs yet

Architecture

Client / Cron Trigger
β†’
API Gateway
β†’
Job Scheduler
↓
Job Queue
Redis / Kafka
↓
Worker Pool
Pull & execute
↓
Job DB
PostgreSQL
Job metadata
↓
Result Store
S3 / DB
Output & logs
Dead Letter Queue
Failed jobs after max retries
Monitoring
Prometheus + Grafana
Alerts on failures

Key Design Decisions

Push vs Pull Model

Push (Scheduler assigns)
  • Scheduler picks worker + pushes job
  • Better load balancing
  • Scheduler is single point of failure
  • Used by Kubernetes Jobs
vs
Pull (Workers poll)
  • Workers pull from queue when ready
  • Natural backpressure
  • Workers self-regulate capacity
  • Used by Celery, Sidekiq

At-Least-Once vs Exactly-Once

At-Least-Once
  • Simpler: retry on failure
  • Jobs must be idempotent
  • Duplicates possible
  • Most systems use this
vs
Exactly-Once
  • Fencing tokens + idempotency keys
  • Higher complexity & latency
  • Needed for payments, billing
  • Requires distributed coordination

Queue Backend: Redis vs Kafka

Redis (sorted sets for priority, pub/sub for notifications): low latency (~1ms), simple, but limited durability. Best for <100K jobs/sec. Kafka (partitioned topics, consumer groups): high throughput, durable, exactly-once via transactions. Best for >100K jobs/sec with strict ordering guarantees. Hybrid approach: Redis for priority queue + Kafka for event log.

Cron Scheduling at Scale

Store cron schedules in DB. A scheduler service polls every second for due jobs. Use leader election (via ZooKeeper/etcd) to ensure only one scheduler instance fires each cron job. Pre-compute next N execution times and insert into a delay queue sorted by execution time. Partition cron jobs by hash to distribute load across scheduler instances.

Capacity Estimation

Estimate the infrastructure needed for your job scheduling system.

Workers needed (peak)--
Queue storage--
DB entries/day--
Avg jobs/sec--

A critical infrastructure component -- understand priority scheduling, exactly-once semantics, and failure recovery patterns.