# designed with ChatGPT help as practice

Scope and direction

Problem statement

A system that:

  • users can use to schedule (and manage )tasks to be executed in a UI
  • tasks can be scheduled to execute once or multiple times (recurring basis) based on cronjob style, no complex calendar
  • the tasks are short running and will run asynchronously


Clarifications

  • what are these tasks? where are they loaded from?
    • user-defined jobs created via an API
    • either a webhook call or an internal job reference
  • what scheduling do we need to support for the recurring basis?
    • basic cron style, no complex calendars
  • how do we want the users to trigger them and managed them?
    • users create, update and cancel tasks
  • do we need to notify the users about success/failure?
    • no, the users can check via UI
  • how long do these tasks run for? what resources do they need? should they be configurable?
    • a few seconds,
    • fixed resources
  • I assume it's only asynchronous?
    • yes



Core constraints

  • high scalability -> execute thousands of tasks concurrently
  • timely exection -> minimal delay
  • high reliability -> tasks must not be lost under failures (at-least-once run)



Requirements

Functional Requirements:

  • users create and manage tasks via UI
  • the tasks are either webhook calls or internal job references
  • tasks can be scheduled for a single adhoc run or they can be schedule on a simple cron style calendar
  • jobs run asynchronously via workers


Non-Functional Requirements:

  • high reliablity so once scheduled the job runs at least once
  • medium scale (thousands of tasks per minute)
  • low scheduling latency (best effort)

The hardes non-functional constraint is the at least once run under failures.


API Design

Caller: UI + internal services

As a minimal surface API for now:

POST /tasks {schedule, payload} -> {task_id} PUT /tasks/{task_id} {schedule, payload} DELETE /tasks/{task_id} GET /tasks/{task_id} {status, last_run, next_run}




High-Level Design

Components

  • Frontend
  • API Gateway
  • Scheduler Service (stateless)
  • Task DB
  • Dispatcher
  • Queue
  • Worker Pool

As shown in the high-level diagram, UI and APIs interact with the Scheduler Service, which persists tasks and dispatches them to workers via a queue.


Data model

Task { task_id, schedule, payload, status, next_run_at, attempts, created_at }


Read/execution path

  1. Dispatcher queries tasks where next_run_at <= now
  2. Tasks are atomically claimed
  3. Claimed tasks are pushed to the queue
  4. Workers execute tasks asynchronously
  5. Result updates task status and schedules next run (if recurring)


Write path - task creation

  1. User creates task via UI/API
  2. Scheduler validates schedule and stores task
  3. Task becomes eligible for dispatch via time index


Failure scenario - worker failure

  • Task execution fails
  • Attempts incremented
  • Task rescheduled with backoff
  • At-least-once guarantee preserved


Capacity estimate

  • Assume ~5k tasks/minute
  • ~10M tasks stored → ~10–20GB DB storage
  • Dispatcher scales horizontally by sharding on time buckets


Trade-offs

  • At-least-once execution simplifies reliability but may cause duplicate executions
  • Polling-based dispatch is simpler than push-based timers but slightly less precise


Future improvements

  • User notifications (email/webhooks)
  • Per-tenant rate limits
  • Multi-region schedulers



Detailed Component Design

Deep Dive 1 — Task claiming & concurrency


Problem

Multiple dispatcher instances may pick the same due task, causing duplicate execution.


Solution

Dispatchers atomically claim tasks using a state transition (SCHEDULED → IN_FLIGHT) with a lease (locked_until).

Only successfully claimed tasks are enqueued for execution.


Trade-off

Slightly higher DB contention

Strong correctness guarantees (prevents double dispatch)


Deep Dive 2 — Recurring task scheduling

Problem

Recurring tasks must be rescheduled reliably without drifting or missing executions.


Solution

After execution, compute next_run_at based on the cron expression.

Persist the next execution time atomically with the execution result.


Trade-off

More scheduler logic

Simple, predictable recurring behaviour with durability