My Solution for Design a Movie Reviews Aggregator System

by nectar4678

System requirements


Functional:

Collect Reviews: The system should fetch movie reviews from various sources, such as IMDb, Rotten Tomatoes, Metacritic, and others.

Aggregate Reviews: Combine reviews for each movie and present them in a unified format, highlighting individual reviews and generating an average rating.

Search and Filter Movies: Users should be able to search movies by title, genre, release year, and other attributes. They should also be able to filter reviews by source, rating, or reviewer.

User Registration and Profile: Allow users to register, create profiles, and save favorite movies for future reference.

User Review Submission: Registered users should be able to submit their own reviews and ratings.

User Ratings and Feedback: Users should be able to upvote or downvote other users' reviews.

Movie Metadata: Display metadata such as movie title, synopsis, release date, genre, and director along with reviews.


Non-Functional:

Scalability: The system should support a growing number of users and handle a large number of concurrent requests. Plan for 10x scaling over the next few years.

High Availability: The system should be available 99.9% of the time.

Performance: Average response times should be less than 200ms under normal loads.

Data Consistency: Consistency is critical when aggregating data from multiple sources. Data should be consistent and up-to-date.

Security: User data and API keys for external sources should be securely stored and encrypted.

Rate Limiting: Implement rate limiting to prevent misuse or abuse of the system and APIs.



Capacity estimation

User Base: Initial user base is approximately 1 million users, growing to 10 million over the next few years.


Request Traffic:

  • Monthly Requests: 3 million requests (3 requests/user/month)
  • Daily Requests: 100,000 requests/day
  • Peak Requests: 5x average traffic (500,000 requests/day)


Data Sources: We will aggregate reviews from 10 different sources, and each movie has about 20 reviews on average.


Movies in Database: Approximately 50,000 movies in our database, each having metadata and multiple reviews.


Storage Estimation:

  • Initial storage requirement: 2.025 GB for movies and reviews.
  • Projected growth: 10x over the next few years → 20 GB of storage required.


Network Bandwidth Estimation:

Assuming the average response size is 50KB:

  • Daily Data Transfer:
  • 100,000 requests/day×50 KB=5 GB/day100,000 \text{ requests/day} \times 50 \text{ KB} = 5 \text{ GB/day}100,000 requests/day×50 KB=5 GB/day.
  • Monthly Data Transfer:
  • 5 GB/day×30=150 GB/month5 \text{ GB/day} \times 30 = 150 \text{ GB/month}5 GB/day×30=150 GB/month.




API design

We will be making use of following API's:


User APIs

This include user registration, user login and user profile.


Movie APIs

Search Movies

Endpoint: /api/v1/movies/search Method: GET Query Params: title, genre, year Response: [ { "movie_id": "1", "title": "Inception", "year": "2010", "genre": "Sci-Fi", "rating": 4.8 }, { "movie_id": "2", "title": "The Matrix", "year": "1999", "genre": "Action", "rating": 4.7 } ]


Get Movie Details

Endpoint: /api/v1/movies/{movie_id} Method: GET Response: { "movie_id": "1", "title": "Inception", "year": "2010", "genre": "Sci-Fi", "rating": 4.8, "reviews": [ { "review_id": "rev123", "reviewer": "Rotten Tomatoes", "rating": 5, "content": "A visually stunning masterpiece..." }, { "review_id": "rev124", "reviewer": "IMDb", "rating": 4.5, "content": "Christopher Nolan does it again!" } ] }


Submit Review

Endpoint: /api/v1/movies/{movie_id}/reviews Method: POST Request Body: { "user_id": "12345", "rating": 4.5, "content": "Great movie with complex plot." } Response: { "review_id": "rev125", "message": "Review submitted successfully." }




Database design

The Movie Reviews Aggregator System requires a structured database schema to handle user information, movie metadata, reviews, and interactions with external sources. The design should consider scalability, data integrity, and efficient querying for optimal performance.

Tables and Their Purpose

  1. Users: Store user information, including registration details, preferences, and user-generated content.
  2. Movies: Hold basic movie metadata like title, release year, genre, and overall rating.
  3. Reviews: Contain reviews sourced from different platforms and user-submitted reviews, along with ratings and content.
  4. Review Sources: Maintain a record of external review sources like IMDb, Rotten Tomatoes, etc.
  5. Movie-Review Source Mapping: Create associations between movies and review sources to enable tracking of reviews.
  6. User Favorites: Track the movies that users have marked as favorites.
  7. User Feedback: Store user votes (upvotes or downvotes) on reviews submitted by other users.





High-level design

Components

  1. API Gateway: Acts as the entry point for all client requests, routing them to the appropriate service. Handles authentication, rate limiting, and API versioning.
  2. User Service: Manages user profiles, authentication, and preferences. Handles user registration, login, and profile updates.
  3. Movie Service: Manages movie metadata such as titles, genres, release dates, and synopsis. Also responsible for handling movie searches and providing detailed information.
  4. Review Aggregator Service: Periodically fetches reviews from external sources through APIs, normalizes the data, and stores it in the database.
  5. Review Management Service: Manages user-submitted reviews, allowing users to submit, update, or delete their own reviews. Also manages interactions such as upvotes and downvotes on reviews.
  6. Database: Stores structured data for movies, reviews, and user profiles. Includes relational and non-relational databases for flexibility and performance.
  7. Notification Service: Sends alerts and notifications to users based on their preferences (e.g., new reviews for their favorite movies).
  8. Admin Service: Manages admin-level functionalities, such as adding new movies, moderating user content, and monitoring system health.



Request flows

Data Flow

  1. User Requests:
    • Users interact with the system through the API Gateway.
    • For authentication-related actions, requests are routed to the User Service.
    • For movie search or detailed information, requests are routed to the Movie Service.
    • For review-related actions, requests are routed to the Review Management Service.
  2. Review Aggregation:
    • The Review Aggregator Service periodically fetches reviews from external sources.
    • The service normalizes the data and stores it in the Review Database.
    • Data is made available through the Movie Service for users querying movie details.
  3. User Review Submission:
    • Users can submit reviews through the Review Management Service.
    • The service stores the reviews in the Review Database and updates the overall rating for the movie.
  4. Notifications:
    • The Notification Service sends alerts when new reviews are added to a user's favorite movie.
    • Notifications are also sent when new movies are released that match a user's preferred genres.


Review Submission and Aggregation

  1. A user submits a review for a specific movie through the API Gateway.
  2. The request is routed to the Review Management Service, which stores the review data in the Review Database.
  3. After storing the review, the Review Management Service communicates with the Movie Service to update the movie's overall rating.
  4. The Movie Service recalculates the average rating and updates the Movie Database.
  5. A confirmation message is sent back to the user through the API Gateway.


Detailed component design

Review Aggregator Service

Fetching Reviews:

  • Periodically pulls review data from external APIs.
  • Each source has a specific mapper to convert data into a standardized format.

Data Normalization and Deduplication:

  • Reviews from different sources are transformed into a common structure.
  • Duplicate reviews are detected and filtered using content hashes and timestamps.

Review Storage and Rating Update:

  • After normalization, reviews are stored in the Review Database.
  • Triggers updates to movie ratings in the Movie Database to reflect the latest review data.

Error Handling and Retries:

  • Implements retry logic for failed API calls and sends alerts after exceeding retry limits.
  • Logs errors and skips sources that are unresponsive to avoid blocking the process.

Scalability:

  • Designed as a stateless service that can be horizontally scaled.
  • Uses batch processing and parallel API calls to handle large datasets efficiently.


FOR each movie IN movie_list: FOR each source IN review_sources: reviews = fetch_reviews(movie, source) normalized_reviews = normalize_reviews(reviews) deduplicated_reviews = deduplicate_reviews(normalized_reviews) store_reviews(deduplicated_reviews) update_movie_rating(movie)






Trade offs/Tech choices

  1. Fetching reviews from external sources like IMDb, Rotten Tomatoes, and Metacritic in real-time would require continuous polling or webhook-based integration. Since external review updates are not frequent, implementing a batch processing model (e.g., every 24 hours) balances data freshness and reduces API call overhead.
  2. Calculating the average rating for each movie on every request can become expensive, especially as the number of reviews grows. Instead, pre-aggregating ratings in the database whenever new reviews are added allows for quick response times and reduces query complexity.
  3. External reviews have a different structure and querying requirements compared to user-submitted reviews. Using a NoSQL database like MongoDB for external reviews allows for flexible schema changes as different sources may provide varying data fields. Internal user reviews are better suited to a SQL database for structured queries, aggregation, and transaction support.
  4. Making synchronous requests to external review sources can introduce latency and affect the system’s availability if any external API is slow or unresponsive. Using message queues (e.g., RabbitMQ) to process review fetching and storing asynchronously ensures that the main application flow is not blocked.
  5. Full review text is initially stored for flexibility in providing detailed insights to users. Later, NLP-based summarization can be used to create concise reviews, especially for long-form reviews that may not be suitable for display in their entirety.




Failure scenarios/bottlenecks

External Review Sources Unavailability

  • Scenario: One or more external review sources (e.g., IMDb, Rotten Tomatoes) become unavailable or respond slowly during the aggregation process.
  • Impact: The system cannot fetch the latest reviews, leading to outdated data being displayed to users.


Database Overload During Peak Traffic

  • Scenario: High traffic spikes (e.g., during movie releases or award events) lead to a surge in requests, overwhelming the primary database.
  • Impact: Increased response times, potential timeouts, or complete unavailability of the service.


Failure in Review Aggregator Service

  • Scenario: The Review Aggregator Service crashes or becomes unresponsive due to unhandled exceptions or resource exhaustion.
  • Impact: Review updates from external sources are halted, leading to stale review data.


Inconsistent Data Between Microservices

  • Scenario: Inconsistent state between microservices (e.g., Movie Service and Review Management Service) due to eventual consistency issues or failed updates.
  • Impact: Data discrepancies such as mismatched review counts or outdated movie ratings.





Future improvements

  1. Implement webhook-based integrations or push notifications from external sources (e.g., IMDb, Rotten Tomatoes) to receive real-time updates as soon as new reviews are published.
  2. Develop a recommendation engine that uses collaborative filtering or content-based filtering to suggest movies based on a user’s past interactions (e.g., reviews written, movies favorited) and preferences.
  3. Implement an AI-based moderation system using Natural Language Processing (NLP) techniques to automatically flag reviews containing offensive language, spam, or low-quality content. Alternatively, implement a community-based moderation system where users can flag inappropriate content.
  4. Implement multi-tenancy support to allow businesses to customize the review aggregation system for their use cases. This includes custom branding, separate data stores, and access controls.