System requirements


Functional:

  • creating tasks:
    • users can create tasks with correct properties
      • include an expiry length
  • Two task scheduling types:
    1. once at a specific time
    2. repeating (on an interval)
  • task management:
    • users can update tasks
    • users can delete tasks
  • task execution:
    • system executes tasks at the specified time
  • view success/failure of tasks:
    • users can view if past executions of tasks were successful
    • users can view the text output and logs from successful and failed tasks (expires!)


Non-Functional:

  • high reliability
  • manage thousands of tasks
  • security:
    • user can view and edit all tasks within their security group
  • tasks can be replicated globally




Capacity estimation

  • Need to be able to store and execute thousands of tasks
  • 100 creation / day - > 1 per minute (very light)
    • latency: 10 seconds
  • execution of tasks:
    • average: 1000 per day
    • peaks at local times:
      • late night peak 300 at midnight in PST
      • early morning peak 100 at 5 AM
      • super peak: 00:00 Monday to gather week data - 750
    • latency:
      • 30 seconds to start tasks - allows for jitter





API design

CRUD:

  • Create: create a new task
    • Endpoint:
      • POST /tasks/
    • Request Parameters:
      • name: string, required
      • executionTime: datetime
      • script: string, required (e..g "/usr/bin/cleanupold --old ".)
      • security group: string, required
      • userid: string, required
    • Response Parameters:
      • identifier of the new task
    • Response Codes:
      • 200 - ok
      • 401 - unautherized user
      • 429 - throttled
  • Read: gets all tasks for a security group
    • Endpoint:
      • GET /tasks/
    • Parameters:
      • security_group: string, required
    • response Parameters:
      • list of task objects with parameters:
        • name: string, required
        • executionTime: datetime
        • script: string, required (e..g "/usr/bin/cleanupold --old ".)
        • security group: string, required
        • userid: string, required
  • Update: update tasks
    • Endpoint: POST/tasks/
      • ...
  • Delete
    • ...






Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...


  • task: NoSQL database
    • name: string, required
    • executionTime: datetime
    • script: string, required (e..g "/usr/bin/cleanupold --old ".)
    • security group: string, required
    • userid: string, required
  • Recently Run tasks: NoSQL database
    • task id
    • task name
    • execution time
    • s3 links to uploaded data



High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...







Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...






Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...






Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...


  • log store: S3
    • fast
    • reliable
    • cheap
    • security enforcement
    • have automatic expiration (example: after 1 year)
  • task database: no sql database such as dynamodb
    • fast look-up
    • expiration
    • not concerned about immediate replication
  • load balancer:
    • API gateway
      • versioning
  • execution queue:
    • SQS
      • how to handle failures:
        • add these to a DLQ
        • another server can upload to s3 error message about failure





Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.


  • long running tasks overlapping with next run:
    • add an expiry
    • add an in-progress status:
      • can't start task if previous run is still going
  • database outages:
    • prevent by using multiple replicas (dynamodb will take of this)
  • queue being overloaded:
    • scale up our number of execution servers
  • very large histories:
    • expire histories over 5 years
    • paginate requests for histories
    • do not download anything from s3 until requested



Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?