Design A Task Scheduler - System Design

Requirements

Functional Requirements:

Execute task
User is able to upload CRON jobs

Non-Functional Requirements:

Low latency when executing jobs
Highly available to make sure jobs are running at specified time

Estimations:

864M jobs scheduled each day
- 60% are one time jobs (520M)
- 40% are CRON jobs (345M)
10,000 jobs uploaded per second

Read heavy due to checking of which jobs need to be run - need to have plan to scale (listed in High Level Design portion)

Also a lot of write throughput - need to have plan to scale (listed in High Level Design portion)

API Design

Core Entities:

JobMetaData

Job Name
StoredLocation
CRON
ScheduledTime
Status
Priority

POST:

/Task/Upload

{

TaskName,

TaskTime,

Task

}

High-Level Design

Main table is JobMetaData table -> Postgres for the relational rows -> Will partition by time intervals to avoid overwhelming a single DB

Async Leader - Follower replication set up with quorum

Reasoning: for most cases itd be okay if job execution starts a couple seconds later but in terms of executing the script, URL location should be consistent/correct and theres retries too

Retries:

Job runner checks for jobs that are available with lets say 10 seconds buffering time -> places job on the priority Queue

-> If fails, updates DB with retry count and increase priority of current job

3 retries max

User uploads task -> Tries to store script in S3 -> If succesful, generate S3 location URL -> Create entry in JobMetaData

Detailed Component Design

Dispatcher service to constantly query partitions and enqueue onto kafka topic depending on priority (low,med,high topics) and each time it fails itll put it on higher priority queue

Consumer nodes consume off of queue and execute job