My Solution for Design a Web Analytics Tool with Score: 8/10

by zion_nectar434

System requirements


Functional:

TrackUserActions (clicks, page visited)

ReadActions

segment audience by location

generate report



Non-Functional:

Availability --> as a customer, I want the dashboards system to be up at all times, plus, the tracking of actions to not suffer from downtime.


Scalability --> I want to be able to ingest as many user actions as possible.


Latency --> Mainly for when we are reading data to fill the dashboard. <1 sec is good, preferably <200ms.


User Privacy --> We should not be mixing actions, these must be segregated by the owning user.




Capacity estimation


Given a platform that targets stores growing organically:


Lets say we have: 10000 customers


For each customer: 1000 store visitors


Amount of writes to the system: 10 actions per store visit per customer


Total Writes = 100.000.000 actions per day


Each action is 1KB


Storage needed = 100GB a day * 2 years = 73TB


Lets round it to 100TB (with some buffer for traffic spikes during the year. E.g., Black Friday week) of storage that is needed for 2 years.





API design


logAction:

Input --> actionId, actionType, timestamp

Output --> 201 created


readAllActions:

Input --> userId, actionType

Output --> list of all actions by type


readAllActionByLocation:

Input --> userId, actionType, location

Output --> list of all actions by type in a given location


generateReport:

Input: userId

Output: downloadable report file





Database design


User: id, storeUrl


Action entity: id, name, type, timestamp, owningUser, location


Relationship: a user owns none, one or many actions. An action is owned by a user.


Access Pattern: given a User --> retrieve all Actions of given type


I would say since the system is write heavy, choosing a NoSQL DB to store data would be the right choice.


NoSQL DB would also give us the flexibility of adding new action types that might have different structures compared to existing ones.


NoSQL DB fits the NFR well since they can be horizontally scaled and we don't need strong consistency of the data.


I would pick Cassandra over MongoDB because Cassandra is optimised for write heavy workloads, all nodes are able to make write operations and we avoid having to create a sharded datastore if we used MongoDB given that only the master node can write.


As for reads, we would be better off if we use a database system that would allow for better querying capabilities. An Analytical warehouse with SQL support makes sense here.


So, the actions would be saved initially into Cassandra and we can have some sort of data management/transformation tool that asynchronously reads that data and inserts it into our warehouse. Given that the application would need to run complex JOIN logic to be able to fulfill the asks of our different use cases (existing and future ones) data can be precomputed/materialised to whatever the expected shapes are. In this way, we would (initially at least) avoid any sort of caching needs.



High-level design




Request flows


logActions flow:


  1. Client sends action that gets routed to an available API server instance through a LoadBalancer.
  2. Request is processed through an actionService
  3. Data is stored in Cassandra


readAllActions flow:


  1. User using the frontend dashboard requests a set of actions based on a specific type or use case.
  2. Data is read from the SQL database that is host of the pre calculated data that is ready for consumption.


readAllActionByLocation flow: same as readAllActions with a given location filter


generateReport:


  1. Customer requests a report that includes the visualisations of the dashboard.
  2. ApplicationService queries all visualisations from SQLDB and generates a PDF file on the fly.





Detailed component design


DataTransferService: this service is a third party tool that is scheduled to perform async polls to read data from Cassandra into our Data Warehouse. There is no need for any transformation at this step, just data basic management of removing unnecessary columns for example if there is any. By giving away such simple task to a third party service we can trade in a bit of extra cost for less complexity, an alternative would creating a service that reads data as it comes and queues it into a message broker for next steps to come.


DataTransformationService: I am gonna be using DBT here (Data Build Tool) which is an open source data transformation tool existing on the market for this kind of workloads. The way to setup this is to create a separate repository that hosts the query that will consume whatever is in the warehouse and perform the transformations that result in tables with precomputed data in our SQLDB. Saving the data precomputed allows the ApplicationService that serves the frontend to access information without running in real time complex queries, improving performance for reads.





Trade offs/Tech choices


  • I used a third party service in the dataTransferService to avoid bearing extra system components. Brings cost up, but lowers complexities. We can take this trade off since the step where this is applied does not hold much logic, it's just moving data around.


  • Pre calculated data into a dedicated read only SQLDB to perform faster reads. We traded in some extra complexity into the system and weaker consistency for higher speed responses.


  • Cassandras eventual consistency model is not an issue, there is no problem if the users sees slightly stale data in the frontend dashboard.







Failure scenarios/bottlenecks


A potential failure scenario could be in the SQLDB read only database with the precomputed data in a scenario where instead of having 10000 customer, we have 1000000 customers for example.


The same bottleneck could be taking place with Cassandra where we might find that a single datastore is not enough for writing x100 the amount of current actions.




Future improvements


  • Sharding the databases' data by a partition key owningUser could be a solution to both scenarios.


  • The generate report action in our database is making them on the fly, to reduce latency, reports can be generated in an async way for faster responses.