# designed with ChatGPT help as practice
Scope and direction
Problem statement
A system that:
- users can use to schedule (and manage )tasks to be executed in a UI
- tasks can be scheduled to execute once or multiple times (recurring basis) based on cronjob style, no complex calendar
- the tasks are short running and will run asynchronously
Clarifications
- what are these tasks? where are they loaded from?
- user-defined jobs created via an API
- either a webhook call or an internal job reference
- what scheduling do we need to support for the recurring basis?
- basic cron style, no complex calendars
- how do we want the users to trigger them and managed them?
- users create, update and cancel tasks
- do we need to notify the users about success/failure?
- no, the users can check via UI
- how long do these tasks run for? what resources do they need? should they be configurable?
- a few seconds,
- fixed resources
- I assume it's only asynchronous?
- yes
Core constraints
- high scalability -> execute thousands of tasks concurrently
- timely exection -> minimal delay
- high reliability -> tasks must not be lost under failures (at-least-once run)
Requirements
Functional Requirements:
- users create and manage tasks via UI
- the tasks are either webhook calls or internal job references
- tasks can be scheduled for a single adhoc run or they can be schedule on a simple cron style calendar
- jobs run asynchronously via workers
Non-Functional Requirements:
- high reliablity so once scheduled the job runs at least once
- medium scale (thousands of tasks per minute)
- low scheduling latency (best effort)
The hardes non-functional constraint is the at least once run under failures.
API Design
Caller: UI + internal services
As a minimal surface API for now:
POST /tasks
{schedule, payload} -> {task_id}
PUT /tasks/{task_id}
{schedule, payload}
DELETE /tasks/{task_id}
GET /tasks/{task_id}
{status, last_run, next_run}
High-Level Design
Components
- Frontend
- API Gateway
- Scheduler Service (stateless)
- Task DB
- Dispatcher
- Queue
- Worker Pool
As shown in the high-level diagram, UI and APIs interact with the Scheduler Service, which persists tasks and dispatches them to workers via a queue.
Data model
Task {
task_id,
schedule,
payload,
status,
next_run_at,
attempts,
created_at
}
Read/execution path
- Dispatcher queries tasks where next_run_at <= now
- Tasks are atomically claimed
- Claimed tasks are pushed to the queue
- Workers execute tasks asynchronously
- Result updates task status and schedules next run (if recurring)
Write path - task creation
- User creates task via UI/API
- Scheduler validates schedule and stores task
- Task becomes eligible for dispatch via time index
Failure scenario - worker failure
- Task execution fails
- Attempts incremented
- Task rescheduled with backoff
- At-least-once guarantee preserved
Capacity estimate
- Assume ~5k tasks/minute
- ~10M tasks stored → ~10–20GB DB storage
- Dispatcher scales horizontally by sharding on time buckets
Trade-offs
- At-least-once execution simplifies reliability but may cause duplicate executions
- Polling-based dispatch is simpler than push-based timers but slightly less precise
Future improvements
- User notifications (email/webhooks)
- Per-tenant rate limits
- Multi-region schedulers
Detailed Component Design
Deep Dive 1 — Task claiming & concurrency
Problem
Multiple dispatcher instances may pick the same due task, causing duplicate execution.
Solution
Dispatchers atomically claim tasks using a state transition (SCHEDULED → IN_FLIGHT) with a lease (locked_until).
Only successfully claimed tasks are enqueued for execution.
Trade-off
Slightly higher DB contention
Strong correctness guarantees (prevents double dispatch)
Deep Dive 2 — Recurring task scheduling
Problem
Recurring tasks must be rescheduled reliably without drifting or missing executions.
Solution
After execution, compute next_run_at based on the cron expression.
Persist the next execution time atomically with the execution result.
Trade-off
More scheduler logic
Simple, predictable recurring behaviour with durability