Codemia | Master System Design Interviews Through Active Practice

My Solution for Design a Frequently Viewed Products Feature for an E-commerce Platform

by nectar4678

System requirements

Functional:

Product View Tracking :

The system should track the products that users view and store this data efficiently for further analysis.

Frequently Viewed Together Identification:

The system should analyze user behavior to identify products frequently viewed together, providing insights that can be used to recommend these products to users.

Display Recommendations:

The system should display products frequently viewed together on the product page to facilitate cross-selling.

Efficient Querying:

The system should allow for querying frequently viewed products based on different time windows (e.g., last 24 hours, last week, last month).

Non-Functional:

The system must be scalable to handle spikes in traffic, especially during peak shopping seasons like Black Friday or holiday sales with recommendation engine operating with low latency.

Capacity estimation

Assumptions:

Active Users: 1 million active users per month.
Daily Active Users (DAUs): 100,000.
Product Views per Day: 500,000 views (5 views per user, per session).
Average Data Stored Per View Event: 1 KB (this includes product ID, user ID, timestamp, session ID, etc.).
Recommendation Query Frequency: Every product page will request product recommendations for frequently viewed products. This could happen 500,000 times per day (one request per product view).
Time Window for Frequent Views: The system tracks product pairs for 30 days.

Data Storage:

Daily Data Storage:
- Each product view event takes around 1 KB of storage.
- 500,000 views/day × 1 KB = 500 MB/day.
Monthly Data Storage:
- Assuming the system stores data for 30 days to analyze product views:
- 500 MB/day × 30 = 15 GB/month.
Storage for Frequently Viewed Products:
- For each product pair (e.g., Product A and Product B viewed together), we estimate 100 bytes of storage (including product pair data, frequency count, and timestamps).
- If we assume 10% of the product views (50,000) generate frequently viewed pairs, the total storage required for product pairs:
- 50,000 pairs/day × 100 bytes = 5 MB/day.
- Over 30 days, this would amount to 150 MB of storage for tracking product pairs.

System Scalability:

To handle 500,000 requests/day, we assume:
- Peak traffic of ~10 requests per second (considering bursts).
- If each server can handle 100 requests/second, 1 server instance will suffice during normal operation, but at least 2 server instances (auto-scaling to 3–4 during high traffic) will be needed for redundancy and scaling.
Database considerations:
- A distributed NoSQL database like Cassandra or DynamoDB would be ideal for handling large write-heavy operations (e.g., product views).
- Read replicas can help handle the read-heavy nature of the recommendation queries.

Cache:

A caching layer (e.g., Redis or Memcached) can be used to store frequent recommendation queries to reduce load on the database.
Cache 20% of the most common queries, which will cover 80% of the traffic (based on the 80/20 rule).

API design

Get Frequently Viewed Products API

This API retrieves products frequently viewed together with a specific product.

Endpoint:

GET /api/v1/products/{productId}/frequently-viewed

Request Parameters:

productId (string): The ID of the product which user is viewing as of now related to which frequently viewed products should be shown.
Optional query parameters:
- timeWindow (string): The time window to filter frequently viewed products (e.g., 24h, 7d, 30d).

Request Example:

GET /api/v1/products/98765/frequently-viewed?timeWindow=7d

{
  "productId": "98765",
  "frequentlyViewed": [
    {
      "productId": "67890",
      "viewCount": 345,
      "lastViewed": "2024-09-14T15:30:00Z"
    },
    {
      "productId": "54321",
      "viewCount": 298,
      "lastViewed": "2024-09-13T12:15:00Z"
    }
  ]
}

API Considerations:

Rate Limiting: The APIs should implement rate limiting to ensure that they handle high traffic efficiently, especially for the frequently-viewed queries.
Authentication: The APIs must be secured using OAuth or token-based authentication to prevent unauthorized access.
Caching: Responses for frequently viewed products should be cached (e.g., with a TTL of 1-5 minutes) to reduce the load on the system.

Database design

Key Entities:

User: Represents users of the e-commerce platform.
Product: Represents products available on the e-commerce platform.
ProductView: Tracks each product view event.
FrequentlyViewedPair: Stores product pairs that are frequently viewed together.

Considerations:

Product Pair Symmetry:
When storing frequently viewed product pairs, we must ensure that the pair (A, B) is treated the same as (B, A) to avoid duplication. The application logic should enforce an ordering rule (e.g., always store the pair where productAId < productBId).
Sharding (NoSQL Option):
For a NoSQL solution (e.g., Cassandra or DynamoDB), the ProductView table could be partitioned by userId to distribute user data across multiple nodes. The FrequentlyViewedPair table could be partitioned by product ID to distribute the load of frequently viewed queries.
Data Expiration:
Data in the ProductView table can be archived or deleted after a set time (e.g., 30 days) to manage storage and keep the system performant.

High-level design

Core Components:

Web Application (E-commerce Platform): The user interface where customers browse products.
Tracking Service: Responsible for logging product views in real-time.
Recommendation Service: Analyzes product view data and identifies frequently viewed product pairs.
Product Database: Stores product details.
Analytics Database: Stores product views and frequently viewed product pairs.
Cache Layer (Redis/Memcached): Caches frequent queries for frequently viewed product pairs.
API Gateway: Acts as an entry point for all API calls, routing requests to appropriate services.
Batch Processing System (Optional): Periodically processes product view data to update frequently viewed products.

Request flows

Frequently Viewed Products Retrieval Flow

This flow explains how the system fetches frequently viewed products for a particular product.

Steps:

User views a product on the e-commerce platform and the Web Application requests frequently viewed products for that product.
The Web Application sends a GET request to the API Gateway.
The API Gateway forwards the request to the Recommendation Service.
The Recommendation Service checks the Cache Layer to see if there are cached frequently viewed products for the requested product.
- If found, the cached result is returned to the Web Application.
- If not found, the Recommendation Service queries the Analytics Database to calculate the frequently viewed products.
The Recommendation Service updates the Cache Layer with the frequently viewed products (if they were fetched from the database).
The frequently viewed products are returned to the Web Application, which displays them to the user.

Detailed component design

Recommendation Service

Role:

The Recommendation Service is responsible for retrieving frequently viewed products based on user behavior. It analyzes product view data and serves recommendations with low latency.

Key Considerations:

Low Latency: The service needs to serve recommendations quickly (within a few milliseconds). To achieve this, it relies heavily on caching frequently viewed product pairs.
Real-time Updates: The service must ensure that recommendations reflect real-time user behavior. This requires constant updates to the frequently viewed product pairs stored in the Analytics Database and Cache Layer.

Design Details:

Cache Layer: A Redis or Memcached cache stores frequently queried product recommendations. The Recommendation Service checks the cache first before querying the database, minimizing read latency.
Real-time Updates to Product Pairs: The Recommendation Service continuously updates the frequently viewed product pairs using either streaming data or batch processing. For example, a background job (or streaming service like Apache Flink) can calculate the most frequently viewed pairs by processing the data stored in the Analytics Database.

function getFrequentlyViewedProducts(productId, timeWindow):
    cacheKey = "frequently_viewed_" + productId + "_" + timeWindow
    result = cache.get(cacheKey)
    
    if result is not None:
        return result
    
    frequentlyViewedPairs = analyticsDatabase.queryFrequentlyViewedPairs(productId, timeWindow)
    cache.set(cacheKey, frequentlyViewedPairs, TTL=5min)
    return frequentlyViewedPairs

Scaling Strategy:

Cache Sharding: As the cache size grows, the Redis cache can be sharded by productId. This allows the Recommendation Service to retrieve cached data from multiple cache nodes, improving performance under high load.
Batch Processing: The service can use a batch-processing system (like Apache Spark) to periodically compute frequently viewed products, optimizing the system for heavy traffic periods.
Streaming Option: Alternatively, using a streaming platform like Apache Kafka Streams or Flink, the product view events can be processed in near real-time, allowing frequent updates to product recommendations without significant delays.

Trade offs/Tech choices

Since the system will be logging millions of product views each month, a NoSQL database like Cassandra or DynamoDB, which excels at handling high write loads, was selected.
Business requirement allows slightly delayed updates (e.g., updating product pairs every few hours), batch processing can be highly efficient. It can process large amounts of data at regular intervals and update the database with the most frequently viewed products.
Kafka can handle large-scale messaging and is ideal for scenarios where we expect a high volume of product view events to be logged continuously.

Failure scenarios/bottlenecks

Ensure that the Tracking Service is deployed in a cloud environment with auto-scaling enabled. This allows the system to spin up additional instances to handle increased load during sales events.
Prewarm the cache with frequently viewed product pairs during low-traffic periods to ensure that the cache is ready to handle traffic spikes.
Periodically run a batch job to reconcile inconsistencies in product pair data. This job can ensure that view counts and product pairs are synchronized across nodes in the Analytics Database.

Future improvements

Move from batch processing of frequently viewed products to a real-time processing model using a streaming platform like Apache Flink or Kafka Streams.
Add machine learning models that generate personalized product recommendations for each user, based on their browsing and purchasing behavior.
Implement geo-replicated databases and caches to ensure low-latency access to product view data and recommendations for users in different regions.