System requirements
Functional:
List functional requirements for the system (Ask the chat bot for hints if stuck.)...
- Allow users to create tasks
- Allow users to create tasks with a particular frequency (monthly, biweekly, weekly)
- Notify the users once the task is executed
- Notifications can be sent via email or push notifications or in-app
Non-Functional:
List non-functional requirements for the system...
- System should have low latency.
- System should be available for the users to create tasks.
- System should be reliable and accurate.
- Tasks should get executed on the desired times without any delays or issues.
- There should not be any silent failures
- System should send notifications to the user on failures.
- System should recover from any failures and notify the users if the task cannot be executed on time.
Capacity estimation
Estimate the scale of the system you are going to design...
- Number of users using the system could be around 500Million
- Daily active users could be around 100K
- Each task meta data could be around 500MB.
- Let us assume each user is creating around 2 tasks every day which implies 2*100K*500MB = 200000*500MB = 0.1PB storage on a daily basis.
- System should support 100K users to create tasks distributed in 24 hours = 69 tasks per sec.
- Let's say creation of each task takes around 30ms, it can accommodate 2K tasks per sec.
- If there are 100 PODS, it can support a concurrent creation of 100 tasks.
- One POD can support up to 2000 requests per sec. During the burst traffic it can maximum support 2000*100 = 20K. If the burst traffic comes to 20K requests in 1sec. It can be handled by the system. But if there are more than 20K in 1 sec, the requests will be throttled.
- We need to enable auto scaling. Auto scaling can be based on the target utilization of the POD's CPU. Let's say its 45%. As the POD CPU utilization reaches 45%, it starts scaling. This would avoid the throttling issue and help handle the burst traffic.
API design
Define what APIs are expected from the system...
POST - v1/user/createtask
{
user_id:
frequency: {monthly | weekly | biweekly| yearly}
start_date: {start date of the task}
end_date:: {end date of the task}
}
{
task_id:{task_id}
}
200 OK response
create a task for a user and return the task id
DELETE - v1/user/task
{
task_id : {task_id}
}
{
200 OK
}
deleted the task given the task id.
PATCH- v1/user/task
{
task_id : {task_id}
end_date:
start_date:
frequency:
}
{
200 OK
}
Modify the parameters of the task.
Database design
Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...
Choose SQL database
User is one entity
Task is another entity
user_id is PK in user table
task_id is PK in task table
user_id is Foreign key in task table
User Table
user_id - UUID data type
first_name - string
last_name - string
creation_time - long
updation_time: same as creation time. long
Task table
task_id - UUID data type
user_id - UUID
frequency - string
start_time - long
end_time - long
creation_time - long
updation_time - long
Task execution table
task_id - UUID
user_id - UUID
execution_time
frequency
end_time
creation_time
updation_time
Task table -> Initial task creation with meta data is written into this table
Task execution -> Every task when gets executed is written into this table.
High-level design
You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Detailed component design
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
- Kafka consumers pull the tasks from the kafka partitions partitioned on the task_id
- kafka consumers push it to SQS queue.
- Kafka consumers retry 10 times if there is any failure in publishing the message to SQS queue.
- it will write to DQL after retries are exhausted
- redrive from DLQ and have consumers pick it up again.
- Once the message is published to SQS
- SQS is partitioned on execution_time.
- SQS consumers pick it up, execute the task and write to task execution table.
- If it fails to execute the task, it will retry 10 times, and then publish to DLQ.
- redrive form DLQ and then retry again.
- consumer will publish a task with a newer execution time into the sqs queue
- New execution time = current execution time + frequency.
- Consumer also publishes the message to another notification kafka topic.
- Notification consumers will consume and sent the notification to the user Id.
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
- Chose SQL DB to store the data. As atomicity is important here, we cannot execute the same task multiple times, we need to maintain idempotency and consistent state . ACID property is important this went with SQL DB
- To support millions of tasks, kafka is a scalable and reliable option
- Messages are expected to be delivered atleast once.
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
- bottleneck is if there are failures while writing to DB or publishing to SQS queue. We will retry mechanisms in place to have exponential retries with a backoff factor of 2.
- After all the retries are exhausted, it will pushed to dead letter queue
- redrive the txns from dead letter queue.
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?