System requirements
Functional:
- creating tasks:
- users can create tasks with correct properties
- include an expiry length
- users can create tasks with correct properties
- Two task scheduling types:
- once at a specific time
- repeating (on an interval)
- task management:
- users can update tasks
- users can delete tasks
- task execution:
- system executes tasks at the specified time
- view success/failure of tasks:
- users can view if past executions of tasks were successful
- users can view the text output and logs from successful and failed tasks (expires!)
Non-Functional:
- high reliability
- manage thousands of tasks
- security:
- user can view and edit all tasks within their security group
- tasks can be replicated globally
Capacity estimation
- Need to be able to store and execute thousands of tasks
- 100 creation / day - > 1 per minute (very light)
- latency: 10 seconds
- execution of tasks:
- average: 1000 per day
- peaks at local times:
- late night peak 300 at midnight in PST
- early morning peak 100 at 5 AM
- super peak: 00:00 Monday to gather week data - 750
- latency:
- 30 seconds to start tasks - allows for jitter
API design
CRUD:
- Create: create a new task
- Endpoint:
- POST /tasks/
- Request Parameters:
- name: string, required
- executionTime: datetime
- script: string, required (e..g "/usr/bin/cleanupold --old ".)
- security group: string, required
- userid: string, required
- Response Parameters:
- identifier of the new task
- Response Codes:
- 200 - ok
- 401 - unautherized user
- 429 - throttled
- Endpoint:
- Read: gets all tasks for a security group
- Endpoint:
- GET /tasks/
- Parameters:
- security_group: string, required
- response Parameters:
- list of task objects with parameters:
- name: string, required
- executionTime: datetime
- script: string, required (e..g "/usr/bin/cleanupold --old ".)
- security group: string, required
- userid: string, required
- list of task objects with parameters:
- Endpoint:
- Update: update tasks
- Endpoint: POST/tasks/
- ...
- Endpoint: POST/tasks/
- Delete
- ...
Database design
Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...
- task: NoSQL database
- name: string, required
- executionTime: datetime
- script: string, required (e..g "/usr/bin/cleanupold --old ".)
- security group: string, required
- userid: string, required
- Recently Run tasks: NoSQL database
- task id
- task name
- execution time
- s3 links to uploaded data
High-level design
You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Detailed component design
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
- log store: S3
- fast
- reliable
- cheap
- security enforcement
- have automatic expiration (example: after 1 year)
- task database: no sql database such as dynamodb
- fast look-up
- expiration
- not concerned about immediate replication
- load balancer:
- API gateway
- versioning
- API gateway
- execution queue:
- SQS
- how to handle failures:
- add these to a DLQ
- another server can upload to s3 error message about failure
- how to handle failures:
- SQS
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
- long running tasks overlapping with next run:
- add an expiry
- add an in-progress status:
- can't start task if previous run is still going
- database outages:
- prevent by using multiple replicas (dynamodb will take of this)
- queue being overloaded:
- scale up our number of execution servers
- very large histories:
- expire histories over 5 years
- paginate requests for histories
- do not download anything from s3 until requested
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?