Design A Task Scheduler - System Design

Requirements

Functional Requirements:

User Interface for Scheduling:Implement a user-friendly API for users to schedule tasks, view current task status, and modify tasks as needed.
Task can be run One-time or recurring
system must trigger tasks very close to scheduled time (low jitter)
must be able to handle thousands or many of scheduled tasks concurrently
mechanism to persist tasks , so they can bsurvive restarts, also can be rebalanced across servers.
Tasks might also include talking to external services. so, retries and failure handling is required
Clock Synchronisation across servers.
Also risk of reexecution of same task due to failure or retries. So, will need idempotency or deduplication
Recurrring task shouldn't drift too much over time .
Recurring scheduling : fixed interval and full cron expression with timezone and calendars.
API should be multi‑tenant and secure (auth, rate limiting, per‑tenant quotas)
At the moment single region based scheduling is enough but if geo- distributed scheduling might be added in future. so, scalable!
Notification and Logging: Include features to notify users via email or webhook upon task completion, failure, or retries. Keep logs of task executions for auditing purposes.
Scheduling Conflicts Handling
Time Zone Support: Provide robust support for time zone specifications when scheduling tasks, especially for recurring tasks that might occur across different regions.
Task Dependency Management: Allow users to define dependencies between tasks to ensure that certain tasks only run after the completion of others.

Non-Functional Requirements:

High Availability
Reliability
Faiult Tolerant
Scalable
Secure

API Design

Create one‑time task
- POST /tasks
- Request:
  - target_type (e.g., HTTP_CALLBACK, QUEUE_MESSAGE)
  - target_config (e.g., URL, headers/body, queue name, payload)
  - run_at (ISO timestamp, UTC)
  - idempotency_key (optional string)
- Response:
  - task_id, status, run_at
Create recurring task
- POST /recurring-tasks
- Request:
  - name
  - target_type
  - target_config
  - start_time (UTC)
  - interval_seconds (simple fixed interval as per transcript, not full cron)
  - max_runs (optional)
  - idempotency_key (optional)
- Response:
  - recurring_task_id, next_run_at, status
Get task status
- GET /tasks/{task_id}
- Response:
  - task_id, status (PENDING, RUNNING, SUCCESS, FAILED, CANCELLED)
  - run_at, started_at, completed_at, last_error, attempts
Cancel task
- POST /tasks/{task_id}/cancel
- Response:
  - task_id, status
Pause / resume recurring task
- POST /recurring-tasks/{id}/pause
- POST /recurring-tasks/{id}/resume
List tasks (for debugging/ops)
- GET /tasks?status=PENDING&limit=100&cursor=...
Task dependency management
- POST /tasks/{task_id}/dependencies
  - Add / update dependencies for an existing task (if we want it separate from creation).
- GET /tasks/{task_id}/dependencies
  - Returns the DAG/graph for debugging.
Notification settings and webhooks
- POST /notification-settings
  - Fields: user_id, on_success (bool), on_failure (bool), delivery_type (EMAIL, SLACK_WEBHOOK, HTTP_WEBHOOK), delivery_config (json).
- Scheduler triggers notifications when a task transitions to terminal states (SUCCESS/FAILED) based on this configuration.
Monitoring endpoints
- GET /health
- GET /metrics (Prometheus format), exposing:
  - Tasks scheduled per second, tasks executed per second, success/failure rates, execution latency (scheduled vs actual), scheduler lag, queue depth, etc.
Admin / ops utilities
- POST /tasks/{task_id}/retry
- POST /tasks/{task_id}/force-complete
- GET /scheduler/status (e.g., shard ownership, last DB scan time).

Possible future endpoints

GET /recurring-tasks/{id} and /recurring-tasks for management dashboards.
POST /recurring-tasks/{id}/trigger-now to force a run.
Bulk creation or bulk cancellation endpoints if throughput becomes an issue.

We need two different storage patterns:

Relational DB (e.g., Postgres or MySQL)
- Good for strong consistency, complex queries, joins, and transactions.
- Used for task metadata, recurring definitions, and history.
In-memory / cache / priority structure
- Example: Redis, or internal in-memory structure per scheduler shard.
- Used for quick access to the “next” tasks to run (sorted by run_at).

We’ll use Postgres as the main durable store, and Redis for a near-future task queue / scheduling wheel.

Table examples

Table: tasks (one-time or instances of recurring tasks)

id (UUID, PK)
recurring_task_id (nullable, FK to recurring_tasks)
status (enum: PENDING, RUNNING, SUCCESS, FAILED, CANCELLED)
run_at (timestamp with time zone, indexed)
created_at (timestamp)
updated_at (timestamp)
target_type (varchar)
target_config (jsonb) – HTTP URL, headers, payload, etc.
attempts (int)
max_attempts (int)
last_error (text, nullable)
idempotency_key (varchar, nullable, indexed)
dedupe_key (varchar, nullable) – for exactly-once semantics if needed
priority (int, default 0)
time_zone (varchar, optional, if we decide to support it later, but for now we store UTC and keep tz on the client side)
shard_key (int or varchar) – to assign tasks to a scheduler shard deterministically.

Useful indexes

Index on (status, run_at) to find due tasks efficiently: status = 'PENDING' AND run_at <= now().
Index on recurring_task_id to query all instances of a recurring schedule.
Unique index on (idempotency_key) to prevent duplicate creation.
btree index on (status, run_at, priority) if we introduce priority scheduling.

Table: task_dependencies (for DAG)

task_id (UUID, FK to tasks)
depends_on_task_id (UUID, FK to tasks)
Composite PK: (task_id, depends_on_task_id)
Optional dependency_type (e.g., MUST_SUCCEED, MUST_FINISH).

To find tasks that are ready to run we’ll either:

Maintain a ready flag / unresolved_dependencies_count column on tasks, or
Compute readiness via queries over task_dependencies (but that’s slower).

Table: recurring_tasks

id (UUID, PK)
name (varchar)
status (enum: ACTIVE, PAUSED, CANCELLED)
start_time (timestamp with time zone)
interval_seconds (int) – simple fixed interval as per transcript hint.
next_run_at (timestamp with time zone, indexed)
max_runs (int, nullable)
runs_count (int)
target_type (varchar)
target_config (jsonb)
created_at (timestamp)
updated_at (timestamp)

Table: task_execution_logs (optional for history/observability)

id (bigserial, PK)
task_id (UUID, FK to tasks)
status (enum: SUCCESS, FAILED)
attempt_number (int)
started_at (timestamp)
completed_at (timestamp)
error_message (text, nullable)

Redis keys

Sorted set: due_tasks with score = run_at_epoch_ms, member = task_id for tasks due in the next N minutes.
This allows O(log N) insertion and O(log N) fetch of earliest tasks.

High-Level Design

Client
- Any service that needs to schedule work.
- Calls the Scheduler API.
API Gateway / Load Balancer
- Fronts the scheduler service for HA and routing.
- Performs auth/rate limiting if needed.
Scheduler API Service
- Stateless HTTP service.
- Validates and accepts schedule requests.
- Writes tasks and recurring tasks to Postgres.
- For near-future tasks, also pushes their IDs into Redis due_tasks sorted set.
- Exposes query and management endpoints (get status, cancel, etc.).
- Implements tasks/mutate, recurring catch-up config, and backoff configuration.
- On create:
  - Validates payload (protects against user errors like invalid target).
  - Upserts by idempotency_key to avoid duplicates.
- On reschedule:
  - Updates run_at, next_attempt_at, and reinsert into Redis/priority queue.
Task Scheduler / Orchestrator
- Background service (could be part of Scheduler API or separate).
- Periodically scans Postgres for tasks with status = PENDING and run_at in the near future (e.g., next 5 minutes) and loads them into Redis sorted set.
- For recurring tasks:
  - Scans recurring_tasks where status = ACTIVE and next_run_at <= now().
  - Creates corresponding concrete tasks rows and updates next_run_at = next_run_at + interval_seconds.
  - Detects downtime by comparing now() with last_run_at.
  - According to catch_up_mode:
    - RUN_ALL_MISSED: compute number of missed intervals; create that many tasks (with run_at spaced at the interval).
    - RUN_ONE_NOW: create one task with run_at = now(), then set next_run_at = now() + interval.
    - SKIP_MISSED: set next_run_at to the next future aligned time; maybe log skipped count.
- Updates last_run_at and next_run_at atomically in a transaction.
- Ensures tasks are sharded/partitioned across instances to avoid duplication (e.g., using consistent hashing on task_id).
Task Dispatcher
- Background consumer that:
  - Pulls tasks ordered by:
    - priority first, then run_at / next_attempt_at.
  - It might use:
    - A priority queue in memory or
    - Redis sorted sets keyed by priority and due_time.
  - Only tasks where:
    - status = PENDING, unresolved_dependencies_count = 0, and run_at or next_attempt_at <= now() are eligible.
  - Polls Redis sorted set for tasks where run_at <= now() + small_delta.
  - Attempts to atomically claim each task (mark as RUNNING in DB with SELECT ... FOR UPDATE or status = PENDING AND ... update with rows affected = 1).
  - On successful claim, sends a message to a queue or directly dispatches to workers.
- If we want jitter < 100ms for due tasks:
  - Dispatchers can:
  - Pop tasks from Redis (or in-memory min-heap) up to, say, 1–2 seconds ahead of time.
  - Each worker holds tasks in a small time wheel / min-heap and sleeps until run_at with high-resolution timers.
- That means per dispatcher maybe a few thousand tasks in RAM at any given time, which is tiny.

Execution Workers
- Fleet of stateless workers that:
  - Consume tasks from a message queue (e.g., Kafka/RabbitMQ), or accept RPC from dispatcher.
  - Execute the target: HTTP call, enqueue message, etc.
  - Update the task status in DB (SUCCESS or FAILED).
  - Perform retries with exponential backoff up to max_attempts.
  - Execute tasks.
  - On failure:
    - If attempts + 1 >= max_attempts → mark FAILED terminally.
    - Else:
      - Computedelay:
        FIXED: backoff_base_delay_ms
        EXPONENTIAL: min(backoff_base_delay_ms * 2^(attempts), backoff_max_delay_ms)
    - Set next_attempt_at = now() + delay.
    - Set status back to PENDING.
Dependency Resolver
- Could be a small background job or logic in the worker:
  - When a task reaches terminal state (SUCCESS or FAILED):
    - Update dependents’ unresolved_dependencies_count.
    - Optionally propagate failure: if a dependency fails and dependency_type = MUST_SUCCEED, mark dependents as FAILED or SKIPPED.
- This is where you’d mention DAG processing and avoiding cycles via validation at creation time.
Message Queue (optional but recommended)
- E.g., Kafka or RabbitMQ.
- Decouples dispatch from execution to smooth spikes and enable parallelism and backpressure.
Postgres DB Cluster
- Primary + replicas.
- Primary handles writes; replicas for read-heavy dashboards.
Redis Cluster
- Stores near-future tasks in a sorted set.
- Enables quick retrieval of tasks due soon across scheduler/dispatcher instances.
Monitoring / Logging
- Hooks into workers and dispatcher to emit metrics:
  - Success/failure counts
  - Retry counts
  - Scheduler lag
  - Skipped recurring runs
- Feeds Prometheus/Grafana or similar
Fault-Tolerance Extras
- Write-ahead log (optional) for tasks being dispatched:
  - Before a dispatcher marks a task as RUNNING, append an event to a WAL (could be Kafka).
  - On crash recovery, if there are tasks in WAL marked as RUNNING with no final status within a timeout, we put them back to PENDING.
- Or simpler:
  - Use a heartbeat/locked_by column and consider tasks stuck in RUNNING beyond a threshold as re‑queue candidates.

Interactions (rough flow)

Client → API Gateway → Scheduler API → Postgres + Redis (for near-future tasks).
Task Scheduler → Postgres (scan recurring_tasks and tasks) → Redis (load due tasks).
Task Dispatcher → Redis (pop due tasks) → Postgres (claim and mark RUNNING) → Message Queue.
Execution Worker → Message Queue → external target (HTTP, queue) → Postgres (update result).

Detailed Component Design

Key challenges

Timing accuracy vs. scalability
- If we used only DB polling (SELECT ... WHERE run_at <= now()), we’d hit the DB a lot and risk delays.
- Using Redis as a “near-future” bucket lets us:
  - Keep all tasks due in the next X minutes sorted by time.
  - Have many dispatcher instances reading quickly without heavy DB scanning.
At-least-once vs. exactly-once
- Distributed systems make “exactly-once” difficult.
- We design for at-least-once execution with idempotent tasks:
  - Task claims are done by atomically updating the DB row: UPDATE tasks SET status='RUNNING' WHERE id=? AND status='PENDING'.
  - Only the instance that successfully updates from PENDING to RUNNING proceeds.
  - If a worker crashes mid-execution, a watchdog can set long-running RUNNING tasks back to PENDING or mark FAILED and reschedule, which might cause re-execution.
- Clients should use idempotency keys and idempotent targets (e.g., operations keyed by an external id).
Recurring scheduling logic
- Based on the transcript, recurring schedules are simple fixed intervals:
  - We only store interval_seconds, no complicated cron, no time zones, no calendar holidays.
- This simplifies the scheduler:
  - For each recurring_task with next_run_at <= now():
    - Start a DB transaction.
    - Insert a new row in tasks with run_at = next_run_at.
    - Update next_run_at = next_run_at + interval_seconds.
    - Increment runs_count; if runs_count >= max_runs, set status = CANCELLED.
  - This avoids drift (we always base on next_run_at, not now()).
Sharding / partitioning
- To avoid two scheduler instances generating duplicate tasks, we can:
  - Use a shard key like hash(recurring_task_id) % N and assign shards to instances.
  - Or use DB-level advisory locks or partitioning.
- Similarly, for tasks, while Redis holds the multiple due tasks, each dispatcher still performs an atomic DB claim, so duplicates from Redis don’t lead to double execution.
Failure modes and retries
- DB down: Scheduler and API will fail to schedule tasks; clients should receive clear errors; we can add local queues or write-ahead logs if needed.
- Redis down: Fallback to DB polling on tasks table at lower frequency; performance degrades but correctness remains.
- Queue backlog: Implement backpressure; dispatcher rate-limits how many tasks it pushes based on queue depth.
Trade-offs
- DB-only vs. DB+Redis+Queue
  - DB-only: simpler architecture, but limited scalability and higher latency.
  - DB+Redis+Queue: more moving parts, but better handling of large volumes and spikes.
  - Given “thousands of tasks” and desire for minimal delay, I’d recommend DB+Redis+Queue, while keeping the APIs simple.
- Simple interval vs. cron
  - Transcript suggests: “it shouldn't be full cron expression with time zones and calendars.”
  - This reduces complexity significantly:
    - no timezone rules
    - no daylight savings anomalies
  - Trade-off is less expressivity, but this seems acceptable for the practice system.

Key Points:

Capacity: thousands of tasks, peaks around 100 exec/sec → we choose Redis + message queue + horizontal worker scaling.
Reliability: tasks must persist and survive restarts → we use Postgres as durable store with strong consistency.
Simplicity of schedules: no full cron/time zones → design schedules as fixed interval_seconds with UTC timestamps.
Low latency execution: tasks should fire near their time → Redis sorted set holding near-future tasks and dispatchers polling frequently.
Resilience to failure: at-least-once semantics, idempotency → DB-based task claiming and retries.
Observability: need to debug timing issues and failures → explicit task_execution_logs and metrics.
Cost: avoid over-engineering → single-region initial deployment with modest cluster sizes; add sharding only when traffic grows.

1. Backoff strategy for retries

Instead of retrying immediately on failure, we compute a next_attempt_at based on backoff.
Exponential backoff with jitter is common:
- delay = random_between(0.5, 1.5) * min(base * 2^attempts, max_delay)
- Jitter prevents large herds of tasks from retrying at exactly the same time.
Trade-offs:
- Pros: protects your downstream services and the scheduler during outages.
- Cons: increases latency for tasks that might succeed on immediate retry.
In an interview, you can say: “I’d default to exponential backoff with jitter, configurable per task.”

2. Push-based vs pull-based scheduling

Pull-based (what we mostly described):
- Workers pull tasks from a queue when they’re ready.
- Easy to scale horizontally and apply backpressure.
Push-based:
- Scheduler directly calls workers when tasks are due.
- Lower latency but harder to scale and manage backpressure.
Hybrid:
- Scheduler pushes tasks into a message queue (push from scheduler, pull by workers).
- That’s effectively what we’re doing: a push into Kafka/RabbitMQ and pull by workers.
You can explicitly mention that this hybrid approach provides low jitter but still handles variable load.

3. Duplicate execution during failures

Potential duplicate execution happens when:
- Worker pulls task, marks RUNNING, does the work, then crashes before writing SUCCESS.
Resolution protocol:
- Each task has a dedupe_key (maybe same as idempotency_key).
- Worker writes a completion record that includes dedupe_key and attempt_number.
- The DB enforces uniqueness on (task_id, attempt_number) or (dedupe_key, final_state).
- Another worker that picks up the same task later will see that a SUCCESS completion exists and can skip executing again (or treat as no-op).
Often, simpler is to accept at-least-once and rely on idempotent targets, but if they push you, mention this pattern.

4. Task dependencies and DAG

We treat the dependency graph as a DAG:
- At creation time, we validate that adding new edges does not create cycles (e.g., DFS or union-find).
Execution:
- Only tasks with unresolved_dependencies_count = 0 are eligible.
- On any predecessor finishing:
  - Decrement count.
  - If count hits zero, mark PENDING → puts it into the “ready” pool.
- If a dependency fails and the dependency type is MUST_SUCCEED:
  - Mark dependent tasks as SKIPPED or FAILED immediately.
Edge cases:
- Misfires or delayed predecessor:
  - The dependent simply waits; unresolved_dependencies_count remains > 0, so it’s never considered due even if its run_at passes.
  - You could expose APIs to override (e.g., “force run this dependent even though a dependency failed”).