Requirements
Functional Requirements:
- Execute task
- User is able to upload CRON jobs
Non-Functional Requirements:
- Low latency when executing jobs
- Highly available to make sure jobs are running at specified time
Estimations:
- 864M jobs scheduled each day
- 60% are one time jobs (520M)
- 40% are CRON jobs (345M)
- 10,000 jobs uploaded per second
Read heavy due to checking of which jobs need to be run - need to have plan to scale (listed in High Level Design portion)
Also a lot of write throughput - need to have plan to scale (listed in High Level Design portion)
API Design
Core Entities:
JobMetaData
- Job Name
- StoredLocation
- CRON
- ScheduledTime
- Status
- Priority
POST:
/Task/Upload
{
TaskName,
TaskTime,
Task
}
High-Level Design
Main table is JobMetaData table -> Postgres for the relational rows -> Will partition by time intervals to avoid overwhelming a single DB
Async Leader - Follower replication set up with quorum
- Reasoning: for most cases itd be okay if job execution starts a couple seconds later but in terms of executing the script, URL location should be consistent/correct and theres retries too
Retries:
Job runner checks for jobs that are available with lets say 10 seconds buffering time -> places job on the priority Queue
-> If fails, updates DB with retry count and increase priority of current job
- 3 retries max
User uploads task -> Tries to store script in S3 -> If succesful, generate S3 location URL -> Create entry in JobMetaData