System requirements
Functional:
- Task creation - create/delete
- Task execution - start/stop
- Tasks orgnization - one task can be executed after the completion of previous task
- Task prioritization - critical tasks can be executed ahead of less critical tasks
- Handling recurring tasks
- User notification - completion, failure, retrial, etc.
- Latency configurable - User can define an acceptable latencies for the task execution.
Non-Functional:
- Fault Tolerance. Implement mechanisms for seamless failover to handle master or worker node failures without service disruption
- Scalibity. System should be able handle thousands of tasks being executed concurrently.
- Reliability. System should be highly reliable to make the tasks executed as expected.
Capacity estimation
- Task creation 10/s
- Task execution 10000/s
- Concurrent tasks assume each task take 5 mins, in total 3 million tasks running concurrently
- Average size of data that each task created is 1 kb, so it would be 10 KB data created per second and 10 * 60 * 60 * 24 = 864000 KB = 864 MB per day
- If we store data for 2 years, it would be 864 * 365 * 2 = 631 GB
- Bandwidth usage: 10 KB * 8 * 3 = 240 Kbps
API design
Login
Post /api/v1/user/login
Params userName/email, password
Return status, accessToken, refreshToken
Task creation
Post /api/v1/task/create
Params accessToken, taskName, taskDesc, maxRetries, userId, priority, associations (upstream or downstream of current task), payload, code to be executed (passed through page, could be a sql query or a code snippet)
Task execution
Get /api/v1/task/execute
Params accessToken, taskId, dateTime
Task cancelation
Get /api/v1/task/cancel
Params accessToken, taskId
Database design
Task metadata table
taskId, taskName, payload, maxRetries, priority, associations, taskDesc, codeSnippet, createTime.
Task execution table
taskId, startTime, stopTime, duration, status, errorMessage
Relational database is good for managing task dependencis, and also natively support optimitic locking mechnicm to prevent race conditions, so we'll choose Relational database for our design.
To futher speed up the processing time, ensure minimal latency and reduce the pressure for the database, we can store the data into Cache such as Redis or Memcache.
We also introduced Elastic Search to basically storing the log data in terms of visualizing through Kibana, so we can easily check the logs and find out the issues.
High-level design
Show as diagram
Auth Service - Handling token generation
Scheduling Service - Receive and process task creation and execution requests, persist the metadata into database and publish the task to the Message Queue
Execution Service - Consume the tasks from Message queue, process it based on priorities and dependencies.
Notification Service - Notify users with exception logs If the task failed
Canal - Managing the real-time updates between database and cache
Monitor Service - Gather system metrics, performance metrics and reports to Grafana
Grafana - Visualize system metrics
Elastic Search - Stroing log info
Kibana - Visualize log info
Request flows
Authentication flow
User made a request -> Rate limiter check the limits -> forward to Auth Service if passed -> generate token and return
Task creation flow
User created a task -> Rate limiter check the limits -> Load balancer routes to Scheduling service if passed -> token is valid, persist the metadata into Database
Task execution flow
User started a task -> Rate limiter check the limits -> Load balancer routes to Scheduling service if passed -> token is valid, push the task to the queue, update the status to executing
Executor service take the task and put into a internal priority queue -> poll out the priority one and check dependencies -> depends on other task, check the status of the upper task -> start processing if upper task finished, otherwise make this task on hold and process the other one.
Detailed component design
1.Handle 3 millions tasks running concurrently
All of our backend services would be highly scalable and deployed multiple instance in terms of handling large numbers of tasks.
Scheduling Service and Executor service would fetch the task info directly from Cache to get the low latency and avoid overwhelming the database.
Cache would be also distributed deploy such as Redis Cluster to handle millions of requests.
We've adopt in Relational database in our design, if the system encountering the bottleneck due to the performance of database, we can consider sharding the database based on task id.
2.Tasks executed at most once
We should set up the message queue to make the message consumed at most once, that means one same task would be executed only once by a same consumer.
3.Manage priorities
In one node: once message consumed by the Executor service, it would be sent into a priority queue. If available service would poll the message out and process it based on its priorities.
Multiple nodes: each service does not aware of the priorities for others, so their would be the possibilities that low priority tasks executed before the high priority one.
We can leverage Redis ZSet to make the tasks ordered by its priorities, Execution service would fetch the status from Redis before starting processing the tasks.
4.Manage dependencies
Before processing a task, Executor service would validate the dependencies first, and check whether the dependencies finished or not. If haven't yet finished, make the current task on hold, otherwise process the task.
Multiple Execution service would ensure the sequence of the execution by fetching the data from Redis
5.Handle recurring tasks
When creating a task, user can config a cron expression and the backend system would ensure the tasks effective in the scheduled time.
Trade offs/Tech choices
Consistency or Availability
Make sure a task executed successfully even if some error happens is more important than strong consistency, it's ok that some users might see the updates of tasks a bit later, so we prefer availability over consistency.
Failure scenarios/bottlenecks
1.Task execution failure
Due to consumer crashes - Failed tasks would be redelievered to the message queue such that other consumers can take it and consume again.
Due to logic error - Failed tasks would be retried certain times based on the configuration, if still failed, update the status and notify the user.
2.Backend Service crashs
We've designed a highly scalable system such that all back end services would not only deploy just one instance, if any service crashs, it would be evited from the cluster temporily and other instance will take over his job.
3.Cache not available
To avoid SPOF of cache we have to make it highly available, we can deploy as Redis Cluster to achieve that
4.Database not available
We can use Single-leader replication of databases to achieve high scalability and high performance of reading, if the write performance still be the bottleneck of our system, we can consider sharding base on taskId
Future improvements
1.For any potential issues users can investigate through Kibana
2.We can build a powerful monitor service to be able to visualize the system metrics and aware of the potential issue in adavance