Design A Task Scheduler - System Design

System requirements

Functional:

creating tasks:
- users can create tasks with correct properties
  - include an expiry length
Two task scheduling types:
1. once at a specific time
2. repeating (on an interval)
task management:
- users can update tasks
- users can delete tasks
task execution:
- system executes tasks at the specified time
view success/failure of tasks:
- users can view if past executions of tasks were successful
- users can view the text output and logs from successful and failed tasks (expires!)

Non-Functional:

high reliability
manage thousands of tasks
security:
- user can view and edit all tasks within their security group
tasks can be replicated globally

Capacity estimation

Need to be able to store and execute thousands of tasks
100 creation / day - > 1 per minute (very light)
- latency: 10 seconds
execution of tasks:
- average: 1000 per day
- peaks at local times:
  - late night peak 300 at midnight in PST
  - early morning peak 100 at 5 AM
  - super peak: 00:00 Monday to gather week data - 750
- latency:
  - 30 seconds to start tasks - allows for jitter

API design

CRUD:

Create: create a new task
- Endpoint:
  - POST /tasks/
- Request Parameters:
  - name: string, required
  - executionTime: datetime
  - script: string, required (e..g "/usr/bin/cleanupold --old ".)
  - security group: string, required
  - userid: string, required
- Response Parameters:
  - identifier of the new task
- Response Codes:
  - 200 - ok
  - 401 - unautherized user
  - 429 - throttled
Read: gets all tasks for a security group
- Endpoint:
  - GET /tasks/
- Parameters:
  - security_group: string, required
- response Parameters:
  - list of task objects with parameters:
    - name: string, required
    - executionTime: datetime
    - script: string, required (e..g "/usr/bin/cleanupold --old ".)
    - security group: string, required
    - userid: string, required
Update: update tasks
- Endpoint: POST/tasks/
  - ...
Delete
- ...

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

task: NoSQL database
- name: string, required
- executionTime: datetime
- script: string, required (e..g "/usr/bin/cleanupold --old ".)
- security group: string, required
- userid: string, required
Recently Run tasks: NoSQL database
- task id
- task name
- execution time
- s3 links to uploaded data

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

log store: S3
- fast
- reliable
- cheap
- security enforcement
- have automatic expiration (example: after 1 year)
task database: no sql database such as dynamodb
- fast look-up
- expiration
- not concerned about immediate replication
load balancer:
- API gateway
  - versioning
execution queue:
- SQS
  - how to handle failures:
    - add these to a DLQ
    - another server can upload to s3 error message about failure

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

long running tasks overlapping with next run:
- add an expiry
- add an in-progress status:
  - can't start task if previous run is still going
database outages:
- prevent by using multiple replicas (dynamodb will take of this)
queue being overloaded:
- scale up our number of execution servers
very large histories:
- expire histories over 5 years
- paginate requests for histories
- do not download anything from s3 until requested

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?