Requirements
Functional Requirements:
- Execute task
- User is able to upload CRON jobs
Non-Functional Requirements:
- Low latency when executing jobs
- Highly available to make sure jobs are running at specified time
Estimations:
- 864M jobs scheduled each day
- 60% are one time jobs (520M)
- 40% are CRON jobs (345M)
- 10,000 jobs uploaded per second
Read heavy due to checking of which jobs need to be run - need to have plan to scale (listed in High Level Design portion)
Also a lot of write throughput - need to have plan to scale (listed in High Level Design portion)
API Design
Core Entities:
JobMetaData
- Job Name
- StoredLocation
- CRON
- ScheduledTime
- Status
- Priority
POST:
/Task/Upload
{
TaskName,
TaskTime,
Task
}
High-Level Design
Main table is JobMetaData table -> Postgres for the relational rows -> Will partition by time intervals to avoid overwhelming a single DB
Async Leader - Follower replication set up with quorum
- Reasoning: for most cases itd be okay if job execution starts a couple seconds later but in terms of executing the script, URL location should be consistent/correct and theres retries too
Retries:
Job runner checks for jobs that are available with lets say 10 seconds buffering time -> places job on the priority Queue
-> If fails, updates DB with retry count and increase priority of current job
- 3 retries max
User uploads task -> Tries to store script in S3 -> If succesful, generate S3 location URL -> Create entry in JobMetaData
Detailed Component Design
Dispatcher service to constantly query partitions and enqueue onto kafka topic depending on priority (low,med,high topics) and each time it fails itll put it on higher priority queue
Consumer nodes consume off of queue and execute job