My Solution for Design a Web Analytics Tool with Score: 8/10
by zion_nectar434
System requirements
Functional:
TrackUserActions (clicks, page visited)
ReadActions
segment audience by location
generate report
Non-Functional:
Availability --> as a customer, I want the dashboards system to be up at all times, plus, the tracking of actions to not suffer from downtime.
Scalability --> I want to be able to ingest as many user actions as possible.
Latency --> Mainly for when we are reading data to fill the dashboard. <1 sec is good, preferably <200ms.
User Privacy --> We should not be mixing actions, these must be segregated by the owning user.
Capacity estimation
Given a platform that targets stores growing organically:
Lets say we have: 10000 customers
For each customer: 1000 store visitors
Amount of writes to the system: 10 actions per store visit per customer
Total Writes = 100.000.000 actions per day
Each action is 1KB
Storage needed = 100GB a day * 2 years = 73TB
Lets round it to 100TB (with some buffer for traffic spikes during the year. E.g., Black Friday week) of storage that is needed for 2 years.
API design
logAction:
Input --> actionId, actionType, timestamp
Output --> 201 created
readAllActions:
Input --> userId, actionType
Output --> list of all actions by type
readAllActionByLocation:
Input --> userId, actionType, location
Output --> list of all actions by type in a given location
generateReport:
Input: userId
Output: downloadable report file
Database design
User: id, storeUrl
Action entity: id, name, type, timestamp, owningUser, location
Relationship: a user owns none, one or many actions. An action is owned by a user.
Access Pattern: given a User --> retrieve all Actions of given type
I would say since the system is write heavy, choosing a NoSQL DB to store data would be the right choice.
NoSQL DB would also give us the flexibility of adding new action types that might have different structures compared to existing ones.
NoSQL DB fits the NFR well since they can be horizontally scaled and we don't need strong consistency of the data.
I would pick Cassandra over MongoDB because Cassandra is optimised for write heavy workloads, all nodes are able to make write operations and we avoid having to create a sharded datastore if we used MongoDB given that only the master node can write.
As for reads, we would be better off if we use a database system that would allow for better querying capabilities. An Analytical warehouse with SQL support makes sense here.
So, the actions would be saved initially into Cassandra and we can have some sort of data management/transformation tool that asynchronously reads that data and inserts it into our warehouse. Given that the application would need to run complex JOIN logic to be able to fulfill the asks of our different use cases (existing and future ones) data can be precomputed/materialised to whatever the expected shapes are. In this way, we would (initially at least) avoid any sort of caching needs.
High-level design
Request flows
logActions flow:
- Client sends action that gets routed to an available API server instance through a LoadBalancer.
- Request is processed through an actionService
- Data is stored in Cassandra
readAllActions flow:
- User using the frontend dashboard requests a set of actions based on a specific type or use case.
- Data is read from the SQL database that is host of the pre calculated data that is ready for consumption.
readAllActionByLocation flow: same as readAllActions with a given location filter
generateReport:
- Customer requests a report that includes the visualisations of the dashboard.
- ApplicationService queries all visualisations from SQLDB and generates a PDF file on the fly.
Detailed component design
DataTransferService: this service is a third party tool that is scheduled to perform async polls to read data from Cassandra into our Data Warehouse. There is no need for any transformation at this step, just data basic management of removing unnecessary columns for example if there is any. By giving away such simple task to a third party service we can trade in a bit of extra cost for less complexity, an alternative would creating a service that reads data as it comes and queues it into a message broker for next steps to come.
DataTransformationService: I am gonna be using DBT here (Data Build Tool) which is an open source data transformation tool existing on the market for this kind of workloads. The way to setup this is to create a separate repository that hosts the query that will consume whatever is in the warehouse and perform the transformations that result in tables with precomputed data in our SQLDB. Saving the data precomputed allows the ApplicationService that serves the frontend to access information without running in real time complex queries, improving performance for reads.
Trade offs/Tech choices
- I used a third party service in the dataTransferService to avoid bearing extra system components. Brings cost up, but lowers complexities. We can take this trade off since the step where this is applied does not hold much logic, it's just moving data around.
- Pre calculated data into a dedicated read only SQLDB to perform faster reads. We traded in some extra complexity into the system and weaker consistency for higher speed responses.
- Cassandras eventual consistency model is not an issue, there is no problem if the users sees slightly stale data in the frontend dashboard.
Failure scenarios/bottlenecks
A potential failure scenario could be in the SQLDB read only database with the precomputed data in a scenario where instead of having 10000 customer, we have 1000000 customers for example.
The same bottleneck could be taking place with Cassandra where we might find that a single datastore is not enough for writing x100 the amount of current actions.
Future improvements
- Sharding the databases' data by a partition key owningUser could be a solution to both scenarios.
- The generate report action in our database is making them on the fly, to reduce latency, reports can be generated in an async way for faster responses.