Requirements


Functional Requirements:

  • Execute task
  • User is able to upload CRON jobs



Non-Functional Requirements:

  • Low latency when executing jobs
  • Highly available to make sure jobs are running at specified time


Estimations:

  • 864M jobs scheduled each day
    • 60% are one time jobs (520M)
    • 40% are CRON jobs (345M)
  • 10,000 jobs uploaded per second


Read heavy due to checking of which jobs need to be run - need to have plan to scale (listed in High Level Design portion)

Also a lot of write throughput - need to have plan to scale (listed in High Level Design portion)




API Design


Core Entities:


JobMetaData

  • Job Name
  • StoredLocation
  • CRON
  • ScheduledTime
  • Status
  • Priority


POST:

/Task/Upload

{

TaskName,

TaskTime,

Task

}



High-Level Design


Main table is JobMetaData table -> Postgres for the relational rows -> Will partition by time intervals to avoid overwhelming a single DB


Async Leader - Follower replication set up with quorum

  • Reasoning: for most cases itd be okay if job execution starts a couple seconds later but in terms of executing the script, URL location should be consistent/correct and theres retries too


Retries:

Job runner checks for jobs that are available with lets say 10 seconds buffering time -> places job on the priority Queue

-> If fails, updates DB with retry count and increase priority of current job

  • 3 retries max


User uploads task -> Tries to store script in S3 -> If succesful, generate S3 location URL -> Create entry in JobMetaData





Detailed Component Design


Dispatcher service to constantly query partitions and enqueue onto kafka topic depending on priority (low,med,high topics) and each time it fails itll put it on higher priority queue


Consumer nodes consume off of queue and execute job