Requirements
Assumption, the task runner is external to our system, and has unlimited resources to run tasks, for example, it's a serverless function.
Functional Requirements:
List the key functional requirements for the system (Ask the AI for hints if stuck)...
- Schedule tasks to be executed once at a specified time in the future or on a recurring basis
- The user can create or delete a task scheduling.
Non-Functional Requirements:
List the key non-functional requirements (performance, scalability, reliability, etc.)...
- Low latency: the delay between task execution time and its specified time should be minimal.
- High reliability: the system should run as robust as possible. For example, failure recovery is critical for such system.
- Scalability: the system should can handle thousands of task scheduling at the same time. And the core components can scale horizontally.
Capability estimation
For such system, I can think the task creation or delte frequency is low, while the task execution frequence and concurrency is high.
- average tasks creation or update for one day: 100
- average active tasks in the system: 1000
- average task executions in the system for one day: 100000
- average task duration: 10 minutes.
- min task execute interval: 1 minute.
A task representation can be like below:
- user_id: 8 bytes
- task_id: 8bytes
- execute_start: 8 bytes (timestamp)
- execute_interval: 8 bytes (timestamp)
- execution_binary: 1000 bytes
- status: 8 bytes: (ACTIVE, DELETED)
- last_schedule_time: 8 bytes.
1 record consumes just 1KB data, if there are total 10k tasks, then I only need 10MB storage.
A task execution record can be like:
- task_id: 8bytes
- execution_id: 8bytes
- schedule_time: 8bytes
- worker_id: 8 bytes
- state: 8 bytes
- external_run_id: 8 bytes
- actions: list of pair (state, time)
API Design
Define the APIs expected from the system. This is your chance to analyze and define the read and write paths so that you can come up with the high-level design...
For task:
create_task
params:
- user_id
- execute_start
- execute_interval (optional)
- execution_binary
response:
- task_id
delete_task
params:
- task_id
For task execution:
create_task_executions:
params:
- task_id
- until
Response:
- list of execution_id
get_task_execution:
params:
- execution_id
update_task_execution:
- execution_id
- state
Database design
user
- user_id
- user_name
tasks
- task_id
- user_id
- execute_start
- execute_interval
- execute_binary
- status
- last_schedule_time
executions
- task_id
- execution_id
- message_id
- schedule_time
- state
- actions
High-Level Design
Describe the overall system architecture. Identify the main components needed to solve the problem end-to-end. Use the diagramming tool to create a block diagram.
The system contains the following components:
Task service
To create, delete or get task record.
Execution service
To create task execution, or update task execution state.
Scheduler
Based on task's scheduling policy, create the specific executions.
Execution monitor
Monitor the executions to be executed in the near future, and push the message into the queue.
Execution worker
It will trigger the task runner instance(for example AWS serverless function), and then monitor the task runner instance status by event listening or long polling. After the task runner instance finishes, it will update the execution's status.
Message Queue
Decouple the scheduler and execution workers, make them can scale independently.
I can use RabbitMQ here because it supports avdanced mechanism such as message ACK and support high consumer concurrency.
And the following are external components which may not be maintained by us:
Notification workers
Notify the task's owner that the task execution state.
Task runner
The real process to run the task, it may be a k8s job, or some async job system.
Request flows
To clarrify the system's running process, let's summarize what happends for a user submitted scheduling task:
- the user creates a task record in the system with execution_start, execution_interval and the execution_binary.
- the scheduler will call the task service to get all tasks whose last_schedule_time is empty or is less than the `last_schedule_time + execute_interval <= current_time + 1hour`, then the scheduler will call the create_task_executions for every task.
- The execution service will create task executions for tasks.
- When the execution schedule_time is less than current_time + 10 minutes, the execution monitor will create the execution message and push it into the queue.
- The execution worker will fetch the execution message and wait until the schedule time, then call the task runner to execute it.
- Until the task execution is done, the execution worker will update the executions status and send the notification message to the message queue.
- the notification workers will notify the task execution's state.
Detailed Component Design
Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.
To achieve minimal delay in the system, I will apply cache with database for the task service and execution service:
Execution service
Execution service is the possible bottleneck, let me introduce it in detail:
The API `create_executions` with until may be invoked frequently by the Execution monitor.
It will store a task's record inside database, if new executions are created, then the task record's last_schedule_time will be updated, then the cache item is removed. If no new executions need to be created, then the task record will be write into the cache if it doesn't exist.
And the execution's record will also be written into the cache after it's updated, so that the query will be efficient.
Execution monitor
Execution monitor is a stateless process, and it will check to-be-executed executions in the near future periodically. I can deploy multiple instances for high reliability, and each of them works independently. And they use row-level locking in the database to avoid duplicate execution messages. If an execution state is already SCHEDULED, nothing will happen.
Its logic is as follows:
```
// find all to-be-executed executions in the near future
// randomize the executions
// iterate every execution
// lock the execution row
// if the execution state is already SCHEDULED, abort
// create the execution message
// update the execution state
// commit the DB update
```
This is row-level locking, so the execution creation can run with high concurrency.
Scheduler
The scheduler is very similar to the execution monitor for high performance and reliability. It will check the task record row and lock it to call `create_task_executions`.
Execution worker
It's the only stateful component of the system. I must ensure that it can recover from failure gracefully to achieve high reliability.
- Fetch the execution message from the queue
- When an execution process needs to create a corresponding task runner instance, it will firstly lock the execution row in the database.
- read the execution row and check whether the external_run_id exists. If it doesn't exist, then call the task runner to create one instance and return the id.
- Update the execution status and write the external_run_id back into the execution db.
- wait for the task runner instance to be finished by long polling. To avoid high load on the task runner's query API, I can use an exponential backoff mechenism here.
- lock the DB again and update the execution status to FINISHED or FAILURE.
- ACK the message from the queue.
Even the execution worker is stateful, but its state is dumped into the database. If it crashes unexpectedly, the execution message will be automatically redistributed to another execution worker and re-executed, without triggering a new external task runner.
It's evident that the execution worker is light load, and it's very easy to scale the workers horizontally. The high reliability and scalability are achieved. If I find that the average delay between created and start time is increasing, we can scale the replicas of execution worker.
Future improvements
Data auto-expire
The system's execution data is massive, for example 10 0k rows every day, but the execution records will not be used for scheduling after the execution finished. We can move the finished execution data every week or month to the archive storage for lower cost and higher performance of the DB.