System requirements


Functional:

  1. schedule tasks to be executed once
  2. schedule tasks on a recurring basis


Non-Functional:

  1. Scale: execute thousands of tasks
  2. Latency: minimal delay
  3. High reliability

Other implicit requirements

  1. Highly available
  2. Security




Capacity estimation

Estimate the scale of the system you are going to design...

  1. API request: 10k QPS, 10k tasks submitted per second
  2. Task Metadata: task_data = [task_name, resource_request, run_script, cron_schedule] = 100kB
  3. Network in: 10k/s * 100kB = 1000MB/s = 8Gbps



API design

Define what APIs are expected from the system...

  1. Submit_task(task_data) -> task_id
  2. Stop_task(task_id)
  3. Create/Get/Update/Delete_task_templates(template_id)




Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...


Task database in NoSQL with multiple tables

  • Task_template_table is used to store task metadata, and should not be big, imaging we have developed 1000s of task templates for our CICD unit tests.
  • easy to split Task_execution_table to multiple DB servers, and improve read and write speed.
  • Scheme less makes it easy to extend the table with more columns as needed.

Task_template_table: {

task_template_id: str,

task_template_name: str,

cpu_request: int,

mem_request: int,

cpu_limit: int,

mem_limit: int,

run_script: str,

cron_schedule: str,

}


Task_execution_table: {

task_id: str,

task_template_id: str,

user_id: str,

node_id: str,

start_time: ctime,

expiry_time: ctime,

status: str,

}

# task_id is unique, e.g. task_template_id + timestamp


User_table: {

user_id: str,

user_name: str,

}


Worker_node_info: {

node_id: str,

available_cpus: int,

available_memory: int,

available_disk: int,

}



High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...


  1. API server: take requests from user to manage templates, submit a new task or stop an existing task
  2. Task scheduler: maintain the worker node cluster info such as available nodes, and based on the info schedule submitted jobs from user
  3. Worker nodes take the tasks from scheduler and run them, report status back to scheduler and database.
  4. Database: as talked in the Database design section




Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

  1. User send request to API server such as create templates, or create a job based on specific template
  2. Job scheduler receive the user request and get template create, or get available nods from node monitor. Start task on the returned node(s) from Node monitor.
  3. Node monitor monitors the task status, and set node status to database (running/success/fail/terminated)




Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...


  1. API server accepts HTTP/Rest request from users, generate a task_id for the specific request, and send the task to . A Load balance can be added in front to scale out horizontally to meet the traffic increase. API server can split the tasks by task_id and load balance them on different job schedulers.
  2. Task scheduler: it handles the task request routed from API server, get available nodes and execute the tasks.
  • Scalability: To better parallelize the task scheduling, the Task scheduler can be a message queue (with multiple nodes or partitions) that accepts tasks including task id and other data necessary to run on task executors.
    • parallel: we can have multiple consumer groups, each for a different job template, so all the same type of jobs will be sent to the corresponding group.
    • sequential: for jobs requires ordered execution, we can have them sent to the same partition so the order can be preserved. adding a common partition key to these tasks will make sure they end up in the same partition.
  • Reliability: tasks will be copy to multiple replica of partitions
  • Fault tolerance:
    • if the task failed due to the script error, the task scheduler will return the error to user, and let user choose to retry.
    • if the task failed due to a node failure or network issue, the task scheduler can fetch the task from the message queue again and reschedule it to another healthy node.
    • if the task succeeds, the task will be removed from the message queue.
  1. Node monitor
  • Consisted of metric server and collectors running on the node, manage node available resources.
  • Monitor node status
    • the available CPU memory and disk
    • the network connection status
    • the task runtime is up
  • Monitor task execution status
    • When task scheduler take the resource from node monitor to run a task, it update the node available resource in db
    • the task hit resource limit, try to threshold them under the limit
    • the task completed/aborted/terminated, set node status to db, and free node available resources in db
  1. Task Execution node pool
  • These nodes are running task execution runtime, such as container environment and can run task as containers.
  • Task scheduler will start a container on the available node returned by Node monitor and run the task scripts inside it.
  • Node pull the container image from image repo and run the task script.
  • Node status and task status will be exposed to Node monitor
  • If a task is consisted of multiple tasks (transactional task), they can be running on different nodes, but the script need to manage data pass between these node, such as using scp to copy data. The task nodes can share the data volume in the pool or use p2p communication such as socket.



Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

  1. Availability vs Consistency: we used replicas for the message queue, which provide HA but weak consistency.
  2. Sequential tasks vs availability: for tasks requires order, we use single partition in the message queue to store tasks, which reduces availability.




Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

  • task execution failure
  • network failure between components and nodes.
    • data checkpoint and retry logic
    • monitor failures and reschedule
  • node failure
    • backup/restore node data (RPO - recover point objective)
    • replace/spin-up nodes (RTO - recover time objective)




Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?


  • Distributed transactional tasks among nodes and aggregate the results can be challenge to design and implement.
  • The manage of Task exec node cluster reliability can be hard, there are container platform like k8s.
  • Task priory management
  • Shorten Node warm up and task spin-up waiting period