Design A Task Scheduler - System Design

# designed with ChatGPT help as practice

Scope and direction

Problem statement

A system that:

users can use to schedule (and manage )tasks to be executed in a UI
tasks can be scheduled to execute once or multiple times (recurring basis) based on cronjob style, no complex calendar
the tasks are short running and will run asynchronously

Clarifications

what are these tasks? where are they loaded from?
- user-defined jobs created via an API
- either a webhook call or an internal job reference
what scheduling do we need to support for the recurring basis?
- basic cron style, no complex calendars
how do we want the users to trigger them and managed them?
- users create, update and cancel tasks
do we need to notify the users about success/failure?
- no, the users can check via UI
how long do these tasks run for? what resources do they need? should they be configurable?
- a few seconds,
- fixed resources
I assume it's only asynchronous?
- yes

Core constraints

high scalability -> execute thousands of tasks concurrently
timely exection -> minimal delay
high reliability -> tasks must not be lost under failures (at-least-once run)

Requirements

Functional Requirements:

users create and manage tasks via UI
the tasks are either webhook calls or internal job references
tasks can be scheduled for a single adhoc run or they can be schedule on a simple cron style calendar
jobs run asynchronously via workers

Non-Functional Requirements:

high reliablity so once scheduled the job runs at least once
medium scale (thousands of tasks per minute)
low scheduling latency (best effort)

The hardes non-functional constraint is the at least once run under failures.

API Design

Caller: UI + internal services

As a minimal surface API for now:

POST /tasks
{schedule, payload} -> {task_id}
PUT /tasks/{task_id}
{schedule, payload}
DELETE /tasks/{task_id}
GET /tasks/{task_id}
{status, last_run, next_run}

High-Level Design

Components

Frontend
API Gateway
Scheduler Service (stateless)
Task DB
Dispatcher
Queue
Worker Pool

As shown in the high-level diagram, UI and APIs interact with the Scheduler Service, which persists tasks and dispatches them to workers via a queue.

Data model

Task {
  task_id,
  schedule,
  payload,
  status,
  next_run_at,
  attempts,
  created_at
}

Read/execution path

Dispatcher queries tasks where next_run_at <= now
Tasks are atomically claimed
Claimed tasks are pushed to the queue
Workers execute tasks asynchronously
Result updates task status and schedules next run (if recurring)

Write path - task creation

User creates task via UI/API
Scheduler validates schedule and stores task
Task becomes eligible for dispatch via time index

Failure scenario - worker failure

Task execution fails
Attempts incremented
Task rescheduled with backoff
At-least-once guarantee preserved

Capacity estimate

Assume ~5k tasks/minute
~10M tasks stored → ~10–20GB DB storage
Dispatcher scales horizontally by sharding on time buckets

Trade-offs

At-least-once execution simplifies reliability but may cause duplicate executions
Polling-based dispatch is simpler than push-based timers but slightly less precise

Future improvements

User notifications (email/webhooks)
Per-tenant rate limits
Multi-region schedulers

Detailed Component Design

Deep Dive 1 — Task claiming & concurrency

Problem

Multiple dispatcher instances may pick the same due task, causing duplicate execution.

Solution

Dispatchers atomically claim tasks using a state transition (SCHEDULED → IN_FLIGHT) with a lease (locked_until).

Only successfully claimed tasks are enqueued for execution.

Trade-off

Slightly higher DB contention

Strong correctness guarantees (prevents double dispatch)

Deep Dive 2 — Recurring task scheduling

Problem

Recurring tasks must be rescheduled reliably without drifting or missing executions.

Solution

After execution, compute next_run_at based on the cron expression.

Persist the next execution time atomically with the execution result.

Trade-off

More scheduler logic

Simple, predictable recurring behaviour with durability