Design A Task Scheduler - System Design

System requirements

Functional:

users can create tasks which can be executed in certain time
users can read/update/delete those tasks
users should be able to see the task execution status

Non-Functional:

reliable: the tasks can be re-executed if failed
consistency: eventual consistency should be fine

Capacity estimation

Assume we have ADU 1M and 10 write per day and 100 read per day

100 write qps and 1000 read qps, for peak, it could be 1k write qps and 10k read qps

Assume each tasks should be less than 1MB, we can have 3.6 PB data each year

API design

createTask(user_id, scheduled_time, task_content) -> task_id
getTask(task_id) -> status, result
updateTask(task_id, sceduled_time, task_content) -> status
deleteTask(task_id) -> status

Database design

task(task_id, user_id, create_time, scheduled_time, status, content)

execution(execution_id, task_id, execution_status, result)

task 1 ... n execution

High-level design

The whoe system contains several parts:

Load balancer and API Gateway is responsible for rate limiting, authn, authz, xxxx etc
Task management service is responsible to CRUD the tasks and write that into Database
the task scheduler server is responsible to
- find the new or failed tasks by polling the DB
- update DB with the status=scheduled
- create a new execution in the table
- push the execution into the message queue
The task executor server is responsible to pull the execution from the queue and assign to its worker. When it is done, update the result into the database

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

DB choice: we will use the nosql because of its better performance and lower cost
The task scheduler service will keep query based on the status so we could build index on status
Inside task executor service and task scheduler service, they have multiple workers running(to poll DB or to run tasks), we can have cooridination service to monitor their health by hearbeat

Trade offs/Tech choices

Rather than using message queue, we can also let task scheduler service write to a being-scheduled table in DB and let task executor service directly poll from the database

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

With higher traffic, we need to

sharding our database and in the meanwhile,
we need to scale up the workers in task scheduler service
we need to scale up the workers in task executor service