Requirements
Functional Requirements:
- User Interface for Scheduling:Implement a user-friendly API for users to schedule tasks, view current task status, and modify tasks as needed.
- Task can be run One-time or recurring
- system must trigger tasks very close to scheduled time (low jitter)
- must be able to handle thousands or many of scheduled tasks concurrently
- mechanism to persist tasks , so they can bsurvive restarts, also can be rebalanced across servers.
- Tasks might also include talking to external services. so, retries and failure handling is required
- Clock Synchronisation across servers.
- Also risk of reexecution of same task due to failure or retries. So, will need idempotency or deduplication
- Recurrring task shouldn't drift too much over time .
- Recurring scheduling : fixed interval and full cron expression with timezone and calendars.
- API should be multi‑tenant and secure (auth, rate limiting, per‑tenant quotas)
- At the moment single region based scheduling is enough but if geo- distributed scheduling might be added in future. so, scalable!
- Notification and Logging: Include features to notify users via email or webhook upon task completion, failure, or retries. Keep logs of task executions for auditing purposes.
- Scheduling Conflicts Handling
- Time Zone Support: Provide robust support for time zone specifications when scheduling tasks, especially for recurring tasks that might occur across different regions.
- Task Dependency Management: Allow users to define dependencies between tasks to ensure that certain tasks only run after the completion of others.
Non-Functional Requirements:
- High Availability
- Reliability
- Faiult Tolerant
- Scalable
- Secure
API Design
- Create one‑time task
POST /tasks- Request:
target_type(e.g.,HTTP_CALLBACK,QUEUE_MESSAGE)target_config(e.g., URL, headers/body, queue name, payload)run_at(ISO timestamp, UTC)idempotency_key(optional string)
- Response:
task_id,status,run_at
- Create recurring task
POST /recurring-tasks- Request:
nametarget_typetarget_configstart_time(UTC)interval_seconds(simple fixed interval as per transcript, not full cron)max_runs(optional)idempotency_key(optional)
- Response:
recurring_task_id,next_run_at,status
- Get task status
GET /tasks/{task_id}- Response:
task_id,status(PENDING, RUNNING, SUCCESS, FAILED, CANCELLED)run_at,started_at,completed_at,last_error,attempts
- Cancel task
POST /tasks/{task_id}/cancel- Response:
task_id,status
- Pause / resume recurring task
POST /recurring-tasks/{id}/pausePOST /recurring-tasks/{id}/resume
- List tasks (for debugging/ops)
GET /tasks?status=PENDING&limit=100&cursor=...
- Task dependency management
POST /tasks/{task_id}/dependencies- Add / update dependencies for an existing task (if we want it separate from creation).
GET /tasks/{task_id}/dependencies- Returns the DAG/graph for debugging.
- Notification settings and webhooks
POST /notification-settings- Fields:
user_id,on_success(bool),on_failure(bool),delivery_type(EMAIL, SLACK_WEBHOOK, HTTP_WEBHOOK),delivery_config(json).
- Fields:
- Scheduler triggers notifications when a task transitions to terminal states (SUCCESS/FAILED) based on this configuration.
- Monitoring endpoints
GET /healthGET /metrics(Prometheus format), exposing:- Tasks scheduled per second, tasks executed per second, success/failure rates, execution latency (scheduled vs actual), scheduler lag, queue depth, etc.
- Admin / ops utilities
POST /tasks/{task_id}/retryPOST /tasks/{task_id}/force-completeGET /scheduler/status(e.g., shard ownership, last DB scan time).
Possible future endpoints
GET /recurring-tasks/{id}and/recurring-tasksfor management dashboards.POST /recurring-tasks/{id}/trigger-nowto force a run.- Bulk creation or bulk cancellation endpoints if throughput becomes an issue.
We need two different storage patterns:
- Relational DB (e.g., Postgres or MySQL)
- Good for strong consistency, complex queries, joins, and transactions.
- Used for task metadata, recurring definitions, and history.
- In-memory / cache / priority structure
- Example: Redis, or internal in-memory structure per scheduler shard.
- Used for quick access to the “next” tasks to run (sorted by
run_at).
We’ll use Postgres as the main durable store, and Redis for a near-future task queue / scheduling wheel.
Table examples
Table: tasks (one-time or instances of recurring tasks)
id(UUID, PK)recurring_task_id(nullable, FK torecurring_tasks)status(enum: PENDING, RUNNING, SUCCESS, FAILED, CANCELLED)run_at(timestamp with time zone, indexed)created_at(timestamp)updated_at(timestamp)target_type(varchar)target_config(jsonb) – HTTP URL, headers, payload, etc.attempts(int)max_attempts(int)last_error(text, nullable)idempotency_key(varchar, nullable, indexed)dedupe_key(varchar, nullable) – for exactly-once semantics if neededpriority(int, default 0)time_zone(varchar, optional, if we decide to support it later, but for now we store UTC and keep tz on the client side)shard_key(int or varchar) – to assign tasks to a scheduler shard deterministically.
Useful indexes
- Index on
(status, run_at)to find due tasks efficiently:status = 'PENDING' AND run_at <= now(). - Index on
recurring_task_idto query all instances of a recurring schedule. - Unique index on
(idempotency_key)to prevent duplicate creation. btreeindex on(status, run_at, priority)if we introduce priority scheduling.
Table: task_dependencies (for DAG)
task_id(UUID, FK totasks)depends_on_task_id(UUID, FK totasks)- Composite PK:
(task_id, depends_on_task_id) - Optional
dependency_type(e.g., MUST_SUCCEED, MUST_FINISH).
To find tasks that are ready to run we’ll either:
- Maintain a
readyflag /unresolved_dependencies_countcolumn ontasks, or - Compute readiness via queries over
task_dependencies(but that’s slower).
Table: recurring_tasks
id(UUID, PK)name(varchar)status(enum: ACTIVE, PAUSED, CANCELLED)start_time(timestamp with time zone)interval_seconds(int) – simple fixed interval as per transcript hint.next_run_at(timestamp with time zone, indexed)max_runs(int, nullable)runs_count(int)target_type(varchar)target_config(jsonb)created_at(timestamp)updated_at(timestamp)
Table: task_execution_logs (optional for history/observability)
id(bigserial, PK)task_id(UUID, FK totasks)status(enum: SUCCESS, FAILED)attempt_number(int)started_at(timestamp)completed_at(timestamp)error_message(text, nullable)
Redis keys
- Sorted set:
due_taskswithscore = run_at_epoch_ms,member = task_idfor tasks due in the next N minutes. - This allows O(log N) insertion and O(log N) fetch of earliest tasks.
High-Level Design
- Client
- Any service that needs to schedule work.
- Calls the Scheduler API.
- API Gateway / Load Balancer
- Fronts the scheduler service for HA and routing.
- Performs auth/rate limiting if needed.
- Scheduler API Service
- Stateless HTTP service.
- Validates and accepts schedule requests.
- Writes tasks and recurring tasks to Postgres.
- For near-future tasks, also pushes their IDs into Redis
due_taskssorted set. - Exposes query and management endpoints (get status, cancel, etc.).
- Implements
tasks/mutate, recurringcatch-upconfig, and backoff configuration. - On
create:- Validates payload (protects against user errors like invalid target).
- Upserts by
idempotency_keyto avoid duplicates.
- On
reschedule:- Updates
run_at,next_attempt_at, and reinsert into Redis/priority queue.
- Updates
- Task Scheduler / Orchestrator
- Background service (could be part of Scheduler API or separate).
- Periodically scans Postgres for tasks with
status = PENDINGandrun_atin the near future (e.g., next 5 minutes) and loads them into Redis sorted set. - For recurring tasks:
- Scans
recurring_taskswherestatus = ACTIVEandnext_run_at <= now(). - Creates corresponding concrete
tasksrows and updatesnext_run_at = next_run_at + interval_seconds. - Detects downtime by comparing
now()withlast_run_at. - According to
catch_up_mode:RUN_ALL_MISSED: compute number of missed intervals; create that manytasks(withrun_atspaced at the interval).RUN_ONE_NOW: create onetaskwithrun_at = now(), then setnext_run_at = now() + interval.SKIP_MISSED: setnext_run_atto the next future aligned time; maybe log skipped count.
- Scans
- Updates
last_run_atandnext_run_atatomically in a transaction. - Ensures tasks are sharded/partitioned across instances to avoid duplication (e.g., using consistent hashing on task_id).
- Task Dispatcher
- Background consumer that:
- Pulls tasks ordered by:
priorityfirst, thenrun_at/next_attempt_at.
- It might use:
- A priority queue in memory or
- Redis sorted sets keyed by
priorityanddue_time.
- Only tasks where:
status = PENDING,unresolved_dependencies_count = 0, andrun_atornext_attempt_at <= now()are eligible.
- Polls Redis sorted set for tasks where
run_at <= now() + small_delta. - Attempts to atomically claim each task (mark as RUNNING in DB with
SELECT ... FOR UPDATEorstatus = PENDING AND ...update with rows affected = 1). - On successful claim, sends a message to a queue or directly dispatches to workers.
- Pulls tasks ordered by:
- If we want jitter < 100ms for due tasks:
- Dispatchers can:
- Pop tasks from Redis (or in-memory min-heap) up to, say, 1–2 seconds ahead of time.
- Each worker holds tasks in a small time wheel / min-heap and sleeps until
run_atwith high-resolution timers.
- That means per dispatcher maybe a few thousand tasks in RAM at any given time, which is tiny.
- Background consumer that:
- Execution Workers
- Fleet of stateless workers that:
- Consume tasks from a message queue (e.g., Kafka/RabbitMQ), or accept RPC from dispatcher.
- Execute the target: HTTP call, enqueue message, etc.
- Update the task status in DB (SUCCESS or FAILED).
- Perform retries with exponential backoff up to
max_attempts. - Execute tasks.
- On failure:
- If
attempts + 1 >= max_attempts→ mark FAILED terminally. - Else:
- Compute
delay:FIXED:backoff_base_delay_msEXPONENTIAL:min(backoff_base_delay_ms * 2^(attempts), backoff_max_delay_ms)
- Compute
- Set
next_attempt_at = now() + delay. - Set
statusback toPENDING.
- If
- Fleet of stateless workers that:
- Dependency Resolver
- Could be a small background job or logic in the worker:
- When a task reaches terminal state (SUCCESS or FAILED):
- Update dependents’
unresolved_dependencies_count. - Optionally propagate failure: if a dependency fails and dependency_type = MUST_SUCCEED, mark dependents as FAILED or SKIPPED.
- Update dependents’
- When a task reaches terminal state (SUCCESS or FAILED):
- This is where you’d mention DAG processing and avoiding cycles via validation at creation time.
- Could be a small background job or logic in the worker:
- Message Queue (optional but recommended)
- E.g., Kafka or RabbitMQ.
- Decouples dispatch from execution to smooth spikes and enable parallelism and backpressure.
- Postgres DB Cluster
- Primary + replicas.
- Primary handles writes; replicas for read-heavy dashboards.
- Redis Cluster
- Stores near-future tasks in a sorted set.
- Enables quick retrieval of tasks due soon across scheduler/dispatcher instances.
- Monitoring / Logging
- Hooks into workers and dispatcher to emit metrics:
- Success/failure counts
- Retry counts
- Scheduler lag
- Skipped recurring runs
- Feeds Prometheus/Grafana or similar
- Hooks into workers and dispatcher to emit metrics:
- Fault-Tolerance Extras
- Write-ahead log (optional) for tasks being dispatched:
- Before a dispatcher marks a task as RUNNING, append an event to a WAL (could be Kafka).
- On crash recovery, if there are tasks in WAL marked as RUNNING with no final status within a timeout, we put them back to PENDING.
- Or simpler:
- Use a
heartbeat/locked_bycolumn and consider tasks stuck in RUNNING beyond a threshold as re‑queue candidates.
- Use a
- Write-ahead log (optional) for tasks being dispatched:
Interactions (rough flow)
- Client → API Gateway → Scheduler API → Postgres + Redis (for near-future tasks).
- Task Scheduler → Postgres (scan recurring_tasks and tasks) → Redis (load due tasks).
- Task Dispatcher → Redis (pop due tasks) → Postgres (claim and mark RUNNING) → Message Queue.
- Execution Worker → Message Queue → external target (HTTP, queue) → Postgres (update result).
Detailed Component Design
Key challenges
- Timing accuracy vs. scalability
- If we used only DB polling (
SELECT ... WHERE run_at <= now()), we’d hit the DB a lot and risk delays. - Using Redis as a “near-future” bucket lets us:
- Keep all tasks due in the next X minutes sorted by time.
- Have many dispatcher instances reading quickly without heavy DB scanning.
- If we used only DB polling (
- At-least-once vs. exactly-once
- Distributed systems make “exactly-once” difficult.
- We design for at-least-once execution with idempotent tasks:
- Task claims are done by atomically updating the DB row:
UPDATE tasks SET status='RUNNING' WHERE id=? AND status='PENDING'. - Only the instance that successfully updates from PENDING to RUNNING proceeds.
- If a worker crashes mid-execution, a watchdog can set long-running RUNNING tasks back to PENDING or mark FAILED and reschedule, which might cause re-execution.
- Task claims are done by atomically updating the DB row:
- Clients should use idempotency keys and idempotent targets (e.g., operations keyed by an external id).
- Recurring scheduling logic
- Based on the transcript, recurring schedules are simple fixed intervals:
- We only store
interval_seconds, no complicated cron, no time zones, no calendar holidays.
- We only store
- This simplifies the scheduler:
- For each
recurring_taskwithnext_run_at <= now():- Start a DB transaction.
- Insert a new row in
taskswithrun_at = next_run_at. - Update
next_run_at = next_run_at + interval_seconds. - Increment
runs_count; ifruns_count >= max_runs, setstatus = CANCELLED.
- This avoids drift (we always base on
next_run_at, notnow()).
- For each
- Based on the transcript, recurring schedules are simple fixed intervals:
- Sharding / partitioning
- To avoid two scheduler instances generating duplicate tasks, we can:
- Use a shard key like
hash(recurring_task_id) % Nand assign shards to instances. - Or use DB-level advisory locks or partitioning.
- Use a shard key like
- Similarly, for
tasks, while Redis holds the multiple due tasks, each dispatcher still performs an atomic DB claim, so duplicates from Redis don’t lead to double execution.
- To avoid two scheduler instances generating duplicate tasks, we can:
- Failure modes and retries
- DB down: Scheduler and API will fail to schedule tasks; clients should receive clear errors; we can add local queues or write-ahead logs if needed.
- Redis down: Fallback to DB polling on
taskstable at lower frequency; performance degrades but correctness remains. - Queue backlog: Implement backpressure; dispatcher rate-limits how many tasks it pushes based on queue depth.
- Trade-offs
- DB-only vs. DB+Redis+Queue
- DB-only: simpler architecture, but limited scalability and higher latency.
- DB+Redis+Queue: more moving parts, but better handling of large volumes and spikes.
- Given “thousands of tasks” and desire for minimal delay, I’d recommend DB+Redis+Queue, while keeping the APIs simple.
- Simple interval vs. cron
- Transcript suggests: “it shouldn't be full cron expression with time zones and calendars.”
- This reduces complexity significantly:
- no timezone rules
- no daylight savings anomalies
- Trade-off is less expressivity, but this seems acceptable for the practice system.
- DB-only vs. DB+Redis+Queue
Key Points:
- Capacity: thousands of tasks, peaks around 100 exec/sec → we choose Redis + message queue + horizontal worker scaling.
- Reliability: tasks must persist and survive restarts → we use Postgres as durable store with strong consistency.
- Simplicity of schedules: no full cron/time zones → design schedules as fixed
interval_secondswith UTC timestamps. - Low latency execution: tasks should fire near their time → Redis sorted set holding near-future tasks and dispatchers polling frequently.
- Resilience to failure: at-least-once semantics, idempotency → DB-based task claiming and retries.
- Observability: need to debug timing issues and failures → explicit
task_execution_logsand metrics. - Cost: avoid over-engineering → single-region initial deployment with modest cluster sizes; add sharding only when traffic grows.
1. Backoff strategy for retries
- Instead of retrying immediately on failure, we compute a
next_attempt_atbased on backoff. - Exponential backoff with jitter is common:
delay = random_between(0.5, 1.5) * min(base * 2^attempts, max_delay)- Jitter prevents large herds of tasks from retrying at exactly the same time.
- Trade-offs:
- Pros: protects your downstream services and the scheduler during outages.
- Cons: increases latency for tasks that might succeed on immediate retry.
- In an interview, you can say: “I’d default to exponential backoff with jitter, configurable per task.”
2. Push-based vs pull-based scheduling
- Pull-based (what we mostly described):
- Workers pull tasks from a queue when they’re ready.
- Easy to scale horizontally and apply backpressure.
- Push-based:
- Scheduler directly calls workers when tasks are due.
- Lower latency but harder to scale and manage backpressure.
- Hybrid:
- Scheduler pushes tasks into a message queue (push from scheduler, pull by workers).
- That’s effectively what we’re doing: a push into Kafka/RabbitMQ and pull by workers.
- You can explicitly mention that this hybrid approach provides low jitter but still handles variable load.
3. Duplicate execution during failures
- Potential duplicate execution happens when:
- Worker pulls task, marks RUNNING, does the work, then crashes before writing SUCCESS.
- Resolution protocol:
- Each task has a
dedupe_key(maybe same asidempotency_key). - Worker writes a completion record that includes
dedupe_keyandattempt_number. - The DB enforces uniqueness on
(task_id, attempt_number)or(dedupe_key, final_state). - Another worker that picks up the same task later will see that a SUCCESS completion exists and can skip executing again (or treat as no-op).
- Each task has a
- Often, simpler is to accept at-least-once and rely on idempotent targets, but if they push you, mention this pattern.
4. Task dependencies and DAG
- We treat the dependency graph as a DAG:
- At creation time, we validate that adding new edges does not create cycles (e.g., DFS or union-find).
- Execution:
- Only tasks with
unresolved_dependencies_count = 0are eligible. - On any predecessor finishing:
- Decrement count.
- If count hits zero, mark PENDING → puts it into the “ready” pool.
- If a dependency fails and the dependency type is MUST_SUCCEED:
- Mark dependent tasks as SKIPPED or FAILED immediately.
- Only tasks with
- Edge cases:
- Misfires or delayed predecessor:
- The dependent simply waits;
unresolved_dependencies_countremains > 0, so it’s never considered due even if itsrun_atpasses. - You could expose APIs to override (e.g., “force run this dependent even though a dependency failed”).
- The dependent simply waits;
- Misfires or delayed predecessor: