Design A Task Scheduler - System Design

System requirements

Functional:

Schedule simple tasks
Schedule recurring task
Schedule dependency based tasks which rely on another tasks to be completed first before them which forms like DAG. once the jobs are entered we can take a separate api endpoint to link dependencies of the jobs like x depends on y and we will generate that graph and schedule them all based on the new time mentioned here. and cyclic graph will be rejected with 400.
Tracking status of jobs.
listing all historical jobs and scheduled jobs
retries are configurable from input json for the APIs, files will be provided based on s3 links list in the json and time to schedule will also be mentioned.
Tasks can be scheduled based on First come first serve if the priority is same for all jobs and if not then based on priority based on 0 to 100 range will be provided by the user. This priority can determine overall system or can be just at user level based on the configuration json provided from the user.
If job fails then we need to retry based on the max retries provided from the user side but if the worker goes down then we should not consider it as job failure and not increment retry count.

Non-Functional:

QPS: 864 million jobs per day, peak load per day will be 10k per second.
1KB max per task with api.
5 mins per task in worst case.
<= 1 second latency
Availability 3 nines and Eventually consistent system.
Fault Tolerant & Reliable system.
Secure system with authentication, authorization & full data security.
Monitoring servers load, database load, api endpoints hit, Logging errors and Alerting: circutbreaking automated based on alearting.
Horizontally scalable system with auto scaling configured on kubernetes.

Capacity estimation

10^4 jobs per second

storage requirement = 10^4 * 1kb = 10 MB per second = 0.864TB per day = 315.36 TB per year = 1.576 PB for 5 years.

so our system is high throughput and decent enough storage requirement.

In case of unexpected load we should increase our QPS to be 2 times and double our storage requirement.

API design

simple tasks api:

POST /v1/tasks/simple/

{

"command" : "python x.py --p1"

"s3_script_path": {"x.py":"www.s3.com/bucket-name/user_name/x.py"}

"max_retries": 5

"schedule-time": "YYYY-MM-DD Hour: Min: TimeZone"

"schedule":"ASAP"

"priority": 20

"priority-level": "global"

"idempotency_key":"sfafa324324"

}

returns: 202 success

processes the request asynchronously.

either schedule-time or schedule

priority-level can be local or global

400 if command, script, schedule time or schedule is not mentioned

if priority is not mentioned that it will be considered as 1 by default

and priority level will be local by default.

401 for unauthentic

403 for unauthorzied

500 for internal server error

recurring tasks api:

POST /v1/tasks/recurring/

input=

{

"command" : "python x.py --p1"

"s3_script_path": {"x.py":"www.s3.com/bucket-name/user_name/x.py"}

"max_retries": 5

"schedule-time": "YYYY-MM-DD Hour: Min: TimeZone"

"interval": "YYYY-MM-DD Hour: Min"

"schedule":"ASAP"

"priority": 20

"priority-level": "global",

"idempotency_key":"sfafa324324"

}

returns: 202 success

{

taskid:"23421"

}

processes the request asynchronously.

either schedule-time or schedule

priority-level can be local or global

400 if command, script, schedule time or schedule is not mentioned

if priority is not mentioned that it will be considered as 1 by default

and priority level will be local by default.will be added to base schedule time each time and scheduled after that much time.

401 for unauthentic

403 for unauthorzied

500 for internal server error

DAG based Jobs:

Create new dag:

POST /v1/tasks/dag/

{

"max_retries": 5

"priority": 20

"priority-level": "global",

"idempotency_key":"sfafa324324"

}

returns: 200 success

priority-level can be local or global

max retry is 1 by default

400 if command or script is not mentioned

if priority is not mentioned that it will be considered as 1 by default

and priority level will be local by default.

401 for unauthentic

403 for unauthorzied

500 for internal server error

This following api just adds job:

POST /v1/tasks/dag/

{

"command" : "python x.py --p1"

"s3_script_path": {"x.py":"www.s3.com/bucket-name/user_name/x.py"}

"max_retries": 5

"priority": 20

"priority-level": "global"

"dag_id": "123",

\"idempotency_key":"sfafa324324"

}

returns: 200 success

priority-level can be local or global

max retry is 1 by default

400 if command or script or dag_id is not mentioned

if priority is not mentioned that it will be considered as 1 by default

and priority level will be local by default.

401 for unauthentic

403 for unauthorized

500 for internal server error

This following api just to start

POST /v1/tasks/dag/

{

"command" : "start"

"dag_id": "1234"

"max_retries": 5

"priority": 20

"priority-level": "global",

"idempotency_key":"sfafa324324"

}

returns: 202 aceepted

max retry is 1 by default

400 if command or script or dag_id is not mentioned

if priority is not mentioned that it will be considered as 1 by default

and priority level will be local by default.

401 for unauthentic

403 for unauthorized

500 for internal server error

This is for making dependency:

POST /v1/tasks/dag/dependency/

{

"dependent_job_id" : "132",

"dependeee_job_id" : "123",

"dag_id" : "1345325"

"idempotency_key":"sfafa324324"

}

returns: 200 success

{

"id""

}

this returns DAG id

max retry is 1 by default

400 if command or script is not mentioned

if priority is not mentioned that it will be considered as 1 by default

and priority level will be local by default.

401 for unauthentic

403 for unauthorized

500 for internal server error

GET /v1/tasks/{id}

returns: 200

{

status:"running"

retries:"2"

}

status can be scheduled, running, completed, failed, retried

retries is number of retries already tried

paginated api for listing jobs:

GET /v1/tasks/?query=page_num=2&pagesize=10

returns: 200

{

job_id:"234"

status:"running"

retries:"2"

{

job_id:"2324"

status:"completed"

retries:"2"

}....

}

All the APIs above will have JWT in their headers for user authentication and authorization .

For all the above APIs idempotency key is mandatory else it will throw 400 bad request.

we can set limit on what will be number of times per minute we can add the

12345678

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?