Design a Distributed Job Scheduler
Priority Queues, Worker Pools, Exactly-Once Execution, Cron Scheduling, and Failure Recovery
A distributed job scheduler orchestrates the execution of tasks across a fleet of workers. It handles priority-based scheduling (high-priority jobs run first), delayed/cron jobs (execute at a future time or on a recurring schedule), exactly-once execution (via fencing tokens and idempotency keys), and failure recovery (retries with exponential backoff, dead letter queues). At scale, systems like Airflow, Temporal, and Celery schedule millions of jobs per day with sub-second dispatch latency, strong consistency guarantees, and automatic worker pool scaling.
Job Queue Visualization
A priority queue ensures high-priority jobs are processed first. The delay queue holds jobs scheduled for future execution -- when their time arrives, they move to the ready queue automatically.
Ready Queue 0
Delay Queue 0
Worker Pool Scaling Simulator
Adjust the number of workers, job arrival rate, and average job duration to see how queue depth, utilization, and throughput change in real time.
Exactly-Once Execution
Distributed job execution must guarantee exactly-once semantics using fencing tokens and idempotency keys. Each worker acquires a lease with a monotonically increasing fencing token before executing a job.
Cron Expression Parser
Parse cron expressions to understand recurring job schedules. Enter a cron expression or use the presets to see the next execution times.
Failure Handling & Retry Flow
When a job fails, it is retried with exponential backoff. After exhausting retries, it moves to the Dead Letter Queue (DLQ) for manual inspection.
Dead Letter Queue 0
Architecture
Redis / Kafka
PostgreSQL
S3 / DB
Alerts on failures
Key Design Decisions
Push vs Pull Model
- Scheduler picks worker + pushes job
- Better load balancing
- Scheduler is single point of failure
- Used by Kubernetes Jobs
- Workers pull from queue when ready
- Natural backpressure
- Workers self-regulate capacity
- Used by Celery, Sidekiq
At-Least-Once vs Exactly-Once
- Simpler: retry on failure
- Jobs must be idempotent
- Duplicates possible
- Most systems use this
- Fencing tokens + idempotency keys
- Higher complexity & latency
- Needed for payments, billing
- Requires distributed coordination
Queue Backend: Redis vs Kafka
Redis (sorted sets for priority, pub/sub for notifications): low latency (~1ms), simple, but limited durability. Best for <100K jobs/sec. Kafka (partitioned topics, consumer groups): high throughput, durable, exactly-once via transactions. Best for >100K jobs/sec with strict ordering guarantees. Hybrid approach: Redis for priority queue + Kafka for event log.
Cron Scheduling at Scale
Store cron schedules in DB. A scheduler service polls every second for due jobs. Use leader election (via ZooKeeper/etcd) to ensure only one scheduler instance fires each cron job. Pre-compute next N execution times and insert into a delay queue sorted by execution time. Partition cron jobs by hash to distribute load across scheduler instances.
Capacity Estimation
Estimate the infrastructure needed for your job scheduling system.