Requirements


Functional Requirements:

  • Execute task
  • User is able to upload CRON jobs



Non-Functional Requirements:

  • Low latency when executing jobs
  • Highly available to make sure jobs are running at specified time


Estimations:

  • 864M jobs scheduled each day
    • 60% are one time jobs (520M)
    • 40% are CRON jobs (345M)
  • 10,000 jobs uploaded per second


Read heavy due to checking of which jobs need to be run - need to have plan to scale (listed in High Level Design portion)

Also a lot of write throughput - need to have plan to scale (listed in High Level Design portion)




API Design


Core Entities:


JobMetaData

  • Job Name
  • StoredLocation
  • CRON
  • ScheduledTime
  • Status
  • Priority


POST:

/Task/Upload

{

TaskName,

TaskTime,

Task

}



High-Level Design


Main table is JobMetaData table -> Postgres for the relational rows -> Will partition by time intervals to avoid overwhelming a single DB


Async Leader - Follower replication set up with quorum

  • Reasoning: for most cases itd be okay if job execution starts a couple seconds later but in terms of executing the script, URL location should be consistent/correct and theres retries too


Retries:

Job runner checks for jobs that are available with lets say 10 seconds buffering time -> places job on the priority Queue

-> If fails, updates DB with retry count and increase priority of current job

  • 3 retries max


User uploads task -> Tries to store script in S3 -> If succesful, generate S3 location URL -> Create entry in JobMetaData





Detailed Component Design