System requirements
Functional:
- users can create tasks which can be executed in certain time
- users can read/update/delete those tasks
- users should be able to see the task execution status
- tasks can be cron jobs
Non-Functional:
- reliable: the tasks can be re-executed if failed
- Consistency: Job results should be consistent, ensuring that jobs are executed once (or with minimal duplication).
Capacity estimation
Assume we have ADU 1M and 10 write per day and 100 read per day
- 100 write qps and 1000 read qps, for peak, it could be 1k write qps and 10k read qps
Assume each tasks should be less than 1MB, we can have 3.6 PB data each year
API design
- createTask(user_id, scheduled_time, task_content) -> task_id
- getTask(task_id) -> status, result
- updateTask(task_id, sceduled_time, task_content) -> status
- deleteTask(task_id) -> status
Database design
task(task_id
execution(execution_id
# execution_status can be: scheduled_to_run, retriable_failed, non-retriable_failed
task 1 ... n execution
High-level design
The whoe system contains several parts:
- Load balancer and API Gateway is responsible for rate limiting, authn, authz, xxxx etc
- Task management service is responsible to CRUD the tasks and write that into Database
- The task scheduler server is responsible to create executions scheduled at the current time point
- The task executor server is responsible to pull the execution from the queue and assign to its worker. When it is done, update the result into the database
Request flows
The whoe system contains several parts:
- the task scheduler service has multiple workers to poll DB partitions and
- find some tasks need to be run
- create a new execution in the table
- push the execution into the message queue
- Also pull the execution table with execution_status=retriable_failed and push that into message queue
- The task executor server is responsible to pull the execution from the queue and assign to its worker. When it is done, update the result into the database
Detailed component design
- DB choice: we will use relationable DB as we want the task to be executed only once.
- Inside task executor service and task scheduler service, they have multiple workers running(to poll DB or to run tasks), we can have cooridination service to monitor their health by hearbeat
Trade offs/Tech choices
Rather than using message queue, we can also let task scheduler service write to a being-scheduled table in DB and let task executor service directly poll from the database
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
Future improvements
With higher traffic, we need to
- sharding our database and in the meanwhile,
- we need to scale up the workers in task scheduler service
- we need to scale up the workers in task executor service