Task Scheduler — Leader election, locks, retries, DLQ, Sidekiq vs Celery vs Temporal

Architecture

Capacity Estimation

Metric	Value	Notes
Tasks/s peak	~50 K	SaaS at scale
Median task duration	50 ms–5 s	I/O bound
Worker count	~2 K	50 K/s × 1 s avg / 25 conc/worker
Queue size in flight	~1 M	scheduled + delayed
Storage (history 30d)	~500 GB	2 KB/task × 200 M/day
Backlog absorb rate	~500 K/s	10× provisioned for spike

Leader Election

Recurring jobs (cron) and delayed-task promotion need exactly one runner cluster-wide; otherwise you fire the same job N times. Use a coordinator that supports leases:

etcd / ZooKeeper / Consul — create an ephemeral leader key with a 10–30 s lease; the holder renews every 5 s. On crash, the lease expires and the next candidate wins.
Postgres advisory lock — SELECT pg_try_advisory_lock(?). Simple, but the lock dies with the connection — no graceful drain on app shutdown.
Redis with Redlock — works in practice for lease durations > clock skew. See the distributed lock page for the Kleppmann critique.

What can the leader actually do? Promote due tasks, fire cron triggers, sweep stuck workers. The leader does not execute tasks itself — that is the worker pool's job. Keep the leader thin.

Distributed Locks (Redlock)

Per-task locks prevent two workers from running the same task. Pattern: SET task:42 worker_id NX EX 30. While the worker is alive, it renews every 10 s. On crash, the lock expires and a peer takes over.

The fencing-token correction (Kleppmann): each lock acquisition increments a monotonic counter; the worker passes the token to any external system it writes to; the external system rejects writes with old tokens. This prevents a stale GC-paused worker from corrupting state on resume.

Exactly-once via Dedup

Networks deliver retries; workers crash mid-task. Exactly-once at the worker is impossible without cooperation from the side-effect target:

Idempotent task — the easiest. The task itself uses an idempotency key when calling external systems. Retries are safe by construction.
Dedup ledger — before running, write (task_id, attempt) to a uniqueness store; if it exists with terminal status, skip. Atomicity with the side effect is the catch — either the task is idempotent or you must use a DB transaction that wraps both the side effect and the ledger update (only works if the side effect is in the same DB).
Outbox pattern — the task writes its intent to an outbox table within the same DB transaction as the business write; a separate process publishes from the outbox to downstream systems. Decouples task atomicity from external delivery.

Retry Policies and Backoff

Retries are mandatory; bad retry policy is its own outage. Use exponential backoff with full jitter:

Exponential: delay = base · 2^attempt. Doubles each time, bounded.
Full jitter (recommended): delay = random(0, base · 2^attempt). Prevents synchronized retry storms after an outage; the alternative ("equal jitter") is marginally better for fairness but full jitter is simpler and proven.
Max attempts: 5–10 typical. Beyond that, the failure is not transient; route to DLQ.
Per-error policy: 4xx no retry (the input is bad); 429/503 retry with longer backoff respecting Retry-After; 5xx generic retry; network timeout retry but cap aggressively.

Priority Queues

One queue is insufficient: a payment retry must not wait behind 100 K weekly-digest emails. Patterns:

Per-priority queues with workers consuming by class — dedicated workers per priority, or weighted round-robin (3× samples from high, 1× from low).
Score-based (Redis ZSET) — score = priority · 1e6 + run_at_epoch; workers BZPOPMIN. Single queue, ordered by score. Good for 100 K–1 M tasks; degrades on huge queues.
Bucketed — high / med / low queues; workers' concurrency budget allocated per bucket. Prevents head-of-line blocking with simpler ops.

Dead-Letter Queues

After max retries, the task moves to a DLQ for human inspection. The DLQ is not a graveyard — it is an alert channel. Best practices:

Preserve full context: original payload, exception trace, all attempts, last status.
Aggregate alerts: 10 tasks of class X failed in 5 min → one PagerDuty page, not 10.
Re-drive UI: ops can edit (fix bad input) and re-enqueue, or batch re-drive a class once a deploy fixes the root cause.
TTL: tasks older than 7 days are archived to S3; otherwise the DLQ becomes infinite.

Sidekiq vs Celery vs Temporal vs Airflow

Sidekiq (Ruby) — Redis-backed, stupid-simple, fast. Best for short web-app tasks. Sidekiq Pro adds reliability via reliable_fetch.
Celery (Python) — backed by RabbitMQ or Redis, broader feature set, heavier. Workhorse of Django stacks; quirky configuration surface.
Temporal — durable workflows: code reads as if it's synchronous, the engine persists every step. Best for multi-step business processes (saga, payment flow). Heavier ops than Sidekiq, but the right tool when steps span minutes to weeks.
Airflow — DAG-based scheduler for batch ETL. Good for nightly jobs and data pipelines; bad for low-latency tasks (minimum scheduling interval is seconds, not milliseconds).
AWS Step Functions / Cadence / Conductor — managed Temporal-likes; pick if you want zero ops at the cost of vendor lock-in.

Choose by task profile: short web async = Sidekiq/Celery. Long multi-step business workflow = Temporal. Daily batch ETL = Airflow.

Failure Modes

Worker poison-pill — a malformed task crashes any worker that touches it. Wrap execution in a try/catch with crash counter; after N crashes, force-DLQ the task.
Visibility timeout misconfigured — long task exceeds the lease, two workers run it simultaneously. Configure lease > p99 task duration; renew during execution.
Backlog runaway — producers outpace workers. Auto-scale workers by queue depth, but cap to protect downstream services.
Cron drift — a cron at 03:00 that took 2 hours yesterday and now overlaps today's. Use a per-job mutex or "skip if previous still running."

FAQ

Postgres or Redis as the queue?

Postgres with SELECT FOR UPDATE SKIP LOCKED is durable and transactional — great when tasks live with the data. Redis ZSET is faster (~5x) but loses tasks on crash without RDB+AOF tuning. Mid-scale teams pick Postgres for one less moving part.

Push or pull workers?

Pull (workers poll) is simpler and self-balancing; push needs a coordinator. Pull with long-poll or BLPOP gives near-push latency without the coordination cost.

How do you do delayed tasks at scale?

Score = run_at epoch in a sorted set; promoter periodically moves due tasks to the ready queue. With 10 M scheduled tasks, partition the sorted set by run_at hour to bound size.

Should I build my own scheduler?

Almost never. Sidekiq/Celery/Temporal cover 95% of needs. Build only if you have unusual durability or scale requirements you have already proven existing engines cannot meet.