Design A Task Scheduler - System Design

Requirements

Assumption, the task runner is external to our system, and has unlimited resources to run tasks, for example, it's a serverless function.

Functional Requirements:

List the key functional requirements for the system (Ask the AI for hints if stuck)...

Schedule tasks to be executed once at a specified time in the future or on a recurring basis
The user can create or delete a task scheduling.

Non-Functional Requirements:

List the key non-functional requirements (performance, scalability, reliability, etc.)...

Low latency: the delay between task execution time and its specified time should be minimal.
High reliability: the system should run as robust as possible. For example, failure recovery is critical for such system.
Scalability: the system should can handle thousands of task scheduling at the same time. And the core components can scale horizontally.

Capability estimation

For such system, I can think the task creation or delte frequency is low, while the task execution frequence and concurrency is high.

average tasks creation or update for one day: 100
average active tasks in the system: 1000
average task executions in the system for one day: 100000
average task duration: 10 minutes.
min task execute interval: 1 minute.

A task representation can be like below:

user_id: 8 bytes
task_id: 8bytes
execute_start: 8 bytes (timestamp)
execute_interval: 8 bytes (timestamp)
execution_binary: 1000 bytes
status: 8 bytes: (ACTIVE, DELETED)
last_schedule_time: 8 bytes.

1 record consumes just 1KB data, if there are total 10k tasks, then I only need 10MB storage.

A task execution record can be like:

task_id: 8bytes
execution_id: 8bytes
schedule_time: 8bytes
worker_id: 8 bytes
state: 8 bytes
external_run_id: 8 bytes
actions: list of pair (state, time)

API Design

Define the APIs expected from the system. This is your chance to analyze and define the read and write paths so that you can come up with the high-level design...

For task:

create_task

params:

user_id
execute_start
execute_interval (optional)
execution_binary

response:

task_id

delete_task

params:

task_id

For task execution:

create_task_executions:

params:

task_id
until

Response:

list of execution_id

get_task_execution:

params:

execution_id

update_task_execution:

execution_id
state

Database design

user

user_id
user_name

tasks

task_id
user_id
execute_start
execute_interval
execute_binary
status
last_schedule_time

executions

task_id
execution_id
message_id
schedule_time
state
actions

High-Level Design

Describe the overall system architecture. Identify the main components needed to solve the problem end-to-end. Use the diagramming tool to create a block diagram.

The system contains the following components:

Task service

To create, delete or get task record.

Execution service

To create task execution, or update task execution state.

Scheduler

Based on task's scheduling policy, create the specific executions.

Execution monitor

Monitor the executions to be executed in the near future, and push the message into the queue.

Execution worker

It will trigger the task runner instance(for example AWS serverless function), and then monitor the task runner instance status by event listening or long polling. After the task runner instance finishes, it will update the execution's status.

Message Queue

Decouple the scheduler and execution workers, make them can scale independently.

I can use RabbitMQ here because it supports avdanced mechanism such as message ACK and support high consumer concurrency.

And the following are external components which may not be maintained by us:

Notification workers

Notify the task's owner that the task execution state.

Task runner

The real process to run the task, it may be a k8s job, or some async job system.

Request flows

To clarrify the system's running process, let's summarize what happends for a user submitted scheduling task:

the user creates a task record in the system with execution_start, execution_interval and the execution_binary.
the scheduler will call the task service to get all tasks whose last_schedule_time is empty or is less than the `last_schedule_time + execute_interval <= current_time + 1hour`, then the scheduler will call the create_task_executions for every task.
The execution service will create task executions for tasks.
When the execution schedule_time is less than current_time + 10 minutes, the execution monitor will create the execution message and push it into the queue.
The execution worker will fetch the execution message and wait until the schedule time, then call the task runner to execute it.
Until the task execution is done, the execution worker will update the executions status and send the notification message to the message queue.
the notification workers will notify the task execution's state.

Detailed Component Design

Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.

To achieve minimal delay in the system, I will apply cache with database for the task service and execution service:

Execution service

Execution service is the possible bottleneck, let me introduce it in detail:

The API `create_executions` with until may be invoked frequently by the Execution monitor.

It will store a task's record inside database, if new executions are created, then the task record's last_schedule_time will be updated, then the cache item is removed. If no new executions need to be created, then the task record will be write into the cache if it doesn't exist.

And the execution's record will also be written into the cache after it's updated, so that the query will be efficient.

Execution monitor

Execution monitor is a stateless process, and it will check to-be-executed executions in the near future periodically. I can deploy multiple instances for high reliability, and each of them works independently. And they use row-level locking in the database to avoid duplicate execution messages. If an execution state is already SCHEDULED, nothing will happen.

Its logic is as follows:

```

// find all to-be-executed executions in the near future

// randomize the executions

// iterate every execution

// lock the execution row

// if the execution state is already SCHEDULED, abort

// create the execution message

// update the execution state

// commit the DB update

```

This is row-level locking, so the execution creation can run with high concurrency.

Scheduler

The scheduler is very similar to the execution monitor for high performance and reliability. It will check the task record row and lock it to call `create_task_executions`.

Execution worker

It's the only stateful component of the system. I must ensure that it can recover from failure gracefully to achieve high reliability.

Fetch the execution message from the queue
When an execution process needs to create a corresponding task runner instance, it will firstly lock the execution row in the database.
read the execution row and check whether the external_run_id exists. If it doesn't exist, then call the task runner to create one instance and return the id.
Update the execution status and write the external_run_id back into the execution db.
wait for the task runner instance to be finished by long polling. To avoid high load on the task runner's query API, I can use an exponential backoff mechenism here.
lock the DB again and update the execution status to FINISHED or FAILURE.
ACK the message from the queue.

Even the execution worker is stateful, but its state is dumped into the database. If it crashes unexpectedly, the execution message will be automatically redistributed to another execution worker and re-executed, without triggering a new external task runner.

It's evident that the execution worker is light load, and it's very easy to scale the workers horizontally. The high reliability and scalability are achieved. If I find that the average delay between created and start time is increasing, we can scale the replicas of execution worker.

Future improvements

Data auto-expire

The system's execution data is massive, for example 10 0k rows every day, but the execution records will not be used for scheduling after the execution finished. We can move the finished execution data every week or month to the archive storage for lower cost and higher performance of the DB.