Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Youtube or Netflix with Score: 8/10

by iridescent_luminous693

System requirements

1. Functional Requirements (FR)

User Management
- Sign up, login, logout
- Multiple profiles per account (Netflix)
Video Uploading / Ingestion
- Users upload videos (YouTube)
- Ingestion pipeline from studios or content providers (Netflix)
- Add metadata: title, description, tags, etc.
- Generate thumbnails
Video Encoding & Transcoding
- Convert to multiple resolutions and formats
- Generate adaptive bitrate streams (HLS, MPEG-DASH)
Video Streaming
- Stream videos efficiently based on user bandwidth
- Support resume from last watch position
Content Discovery
- Search by title, genre, actor, etc.
- Sort and filter results
- Categories like trending, new releases, etc.
Recommendations
- Personalized video suggestions
- Continue watching list
- Auto-play next episode
User Interactions (YouTube)
- Like, dislike, comment on videos
- Subscribe to channels
- Create playlists
Watch History
- Maintain user's playback history
- Show “recently watched” or “watch again”
Subscriptions & Payments (Netflix)
- Choose and manage subscription plans
- Integration with payment gateways
- Access control based on active subscription
Admin Features
- Upload/manage platform content
- Ban/report content or users
- Manage payments and user analytics

2. Non-Functional Requirements (NFR)

Scalability
- Handle millions of concurrent users and videos
- Auto-scale storage and compute resources
High Availability
- Minimal downtime
- Replication across multiple regions
Performance
- Fast load and buffer times
- Low-latency streaming
Security
- Encrypted user data and video content
- Digital Rights Management (DRM) for premium content
- Prevent piracy and unauthorized sharing
Reliability
- Handle failures gracefully
- Retry mechanisms and redundancy
Maintainability
- Clean architecture and modular design
- Easy to patch and update services
Extensibility
- Add support for new content types (e.g., live streaming)
- Easily integrate third-party APIs (e.g., ad services)
Monitoring & Logging
- Real-time analytics on views, errors, system health
- Log user activity for debugging and personalization
SEO Optimization (YouTube)
- Optimize metadata and video pages for search engines
Compliance

GDPR, COPPA, or other legal regulations depending on geography

Capacity estimation

1. User Base and Traffic

For YouTube-scale:

Around 2.5 billion monthly active users.
Roughly 1 billion users use the platform daily.
Each user might watch 60–90 minutes of video per day.
Peak concurrent viewers can go up to 10–20 million.

For Netflix-scale:

Around 250 million monthly active users.
Roughly 100 million use it daily.
Each user might stream 90–120 minutes per day.
Peak concurrency is usually around 5–10 million users.

2. Video Upload and Storage (mostly applies to YouTube)

Around 500 hours of video are uploaded every minute.
That totals to 720,000 hours of video per day.
If 1 hour of 1080p video takes about 2.25 GB, the daily storage requirement is roughly 1.6 petabytes.
Monthly, this comes to about 50 petabytes of new video being added.

3. Transcoding and Video Variants

Each uploaded video is transcoded into multiple formats and resolutions (e.g., 144p, 240p, 360p, 720p, 1080p, 4K).
On average, there are 5–10 variants for each video.
So the total storage consumption due to transcoding increases by a factor of 5 to 10.
This means annual storage might grow by around 250 to 500 petabytes.

4. Video Streaming and Bandwidth

For Netflix:

Assume 100 million users stream for about 2 hours a day.
At 1080p (around 5 Mbps), the total daily data transfer is close to 560 petabytes.
At peak time, with 20 million concurrent users at 5 Mbps, you’d need to serve around 100 terabits per second.

5. Metadata and Comment Storage

Each video might have about 10 KB of metadata.
For 1 billion videos, that’s around 10 terabytes of metadata.
If each video has 100 comments on average, that’s about 100 billion rows in the comments table.
Likes, views, and playlists scale into trillions of records.

6. Watch History

Assuming 1 billion users and 1000 watch history entries per user, you'd end up with around 1 trillion rows in your watch history system.

7. Caching and CDN Load

Most watched videos are cached at the edge (CDN servers).
Just the top 10,000 videos can serve over 90% of the traffic.
Each CDN edge location may need around 1 to 2 terabytes of hot content to be cached.

8. Logs and Analytics

Logging user behavior, video starts/stops, errors, and buffering events can produce multiple terabytes of logs per day.
Analytics pipelines need to process this data in near real-time to update recommendations and dashboards.

API design

1. User APIs

Register User

POST /api/v1/users/register

Creates a new user account with email/password or social login.

Login

POST /api/v1/users/login

Authenticates user and returns access & refresh tokens.

Get User Profile

GET /api/v1/users/me

Returns the logged-in user's profile, settings, and preferences.

Update Profile

PUT /api/v1/users/me

Updates user details like name, language, playback preferences, etc.

2. Video APIs

Upload Video (YouTube-specific)

POST /api/v1/videos/upload

Allows a user to upload a new video file along with metadata (title, description, tags).

Get Video Metadata

GET /api/v1/videos/{videoId}

Fetches title, description, duration, view count, like count, etc.

List Videos

GET /api/v1/videos

Supports filters: category, trending, recent, subscriptions, etc.

Update Video Metadata

PUT /api/v1/videos/{videoId}

Updates the title, tags, description, or thumbnail.

Delete Video

DELETE /api/v1/videos/{videoId}

Removes a video from the platform (for uploader or admin).

3. Playback APIs

Get Streaming URL

GET /api/v1/videos/{videoId}/play

Returns the HLS/MPEG-DASH streaming manifest URL.

Resume Playback

GET /api/v1/users/{userId}/history/{videoId}

Returns last watched timestamp for resume support.

Save Playback Position

POST /api/v1/users/{userId}/history/{videoId}

Saves current position in video for resume functionality.

4. Interaction APIs

Like Video

POST /api/v1/videos/{videoId}/like

Adds a like by the current user.

Dislike Video

POST /api/v1/videos/{videoId}/dislike

Adds a dislike.

Comment on Video

POST /api/v1/videos/{videoId}/comments

Adds a new comment to the video.

Get Comments

GET /api/v1/videos/{videoId}/comments

Returns a paginated list of comments.

Subscribe to Channel (YouTube-specific)

POST /api/v1/channels/{channelId}/subscribe

Subscribes the user to a channel.

Watch History

GET /api/v1/users/{userId}/history

Returns a list of recently watched videos.

5. Recommendation & Discovery APIs

Get Recommendations

GET /api/v1/videos/recommendations

Returns a list of recommended videos for the user.

Search Videos

GET /api/v1/search?q=keyword

Searches for videos by title, tags, or channel.

6. Subscription & Billing APIs (Netflix-specific)

Subscribe to Plan

POST /api/v1/subscribe

User chooses a plan and triggers payment.

Get Current Subscription

GET /api/v1/subscribe

Returns current subscription details and status.

Cancel Subscription

DELETE /api/v1/subscribe

Cancels the user’s subscription.

7. Admin APIs

Approve or Remove Video

PUT /api/v1/admin/videos/{videoId}/approve

For moderating flagged videos.

Ban User

POST /api/v1/admin/users/{userId}/ban

Restricts user access due to policy violations.

View Platform Analytics

GET /api/v1/admin/analytics

Returns dashboards on usage, uploads, views, and revenue.

Database design

1. User Management

DB: Relational DB (PostgreSQL / MySQL)

1. Tradeoff:

Strong consistency guarantees are needed for login, authentication, subscriptions, etc.
Well-suited for structured, relational data like users, subscriptions, billing.
Mature ecosystem, supports transactions.

2. Scalability:

Can handle hundreds of millions of users with read replicas and sharding by user ID.
Might require migration to a distributed SQL DB (e.g., CockroachDB or Vitess) if traffic becomes global and extreme.

3. Schema:

User Table

sql
CopyEdit
CREATE TABLE users (
  user_id UUID PRIMARY KEY,
  email VARCHAR(255) UNIQUE NOT NULL,
  password_hash TEXT NOT NULL,
  name VARCHAR(100),
  created_at TIMESTAMP,
  last_login TIMESTAMP
);

Profile Table (Netflix-style multiple profiles)

sql
CopyEdit
CREATE TABLE profiles (
  profile_id UUID PRIMARY KEY,
  user_id UUID REFERENCES users(user_id),
  profile_name VARCHAR(100),
  language VARCHAR(10),
  is_kid BOOLEAN
);

2. Video Metadata

DB: Relational DB (PostgreSQL) + Search Index (Elasticsearch)

1. Tradeoff:

PostgreSQL is great for video metadata with clear relationships.
Elasticsearch is used to support full-text search and filters (tags, titles, channels).

2. Scalability:

Metadata alone isn’t heavy—PostgreSQL will scale with partitioning.
Elasticsearch clusters scale horizontally for search.

3. Schema:

Videos

sql
CopyEdit
CREATE TABLE videos (
  video_id UUID PRIMARY KEY,
  uploader_id UUID REFERENCES users(user_id),
  title TEXT,
  description TEXT,
  category VARCHAR(50),
  visibility VARCHAR(20), -- public/private/unlisted
  upload_time TIMESTAMP,
  duration INT,
  status VARCHAR(20) -- processing, ready, failed
);

Tags

sql
CopyEdit
CREATE TABLE video_tags (
  video_id UUID REFERENCES videos(video_id),
  tag TEXT
);

3. Comments & Likes

DB: NoSQL (MongoDB / DynamoDB)

1. Tradeoff:

High volume of writes and reads.
Comments and likes don’t need strong joins or ACID transactions.
Document structure fits well (nested replies, metadata).

2. Scalability:

Easily scalable horizontally.
DynamoDB with partition key = video_id ensures even traffic distribution.

3. Schema:

Comments Document (MongoDB-style)

json
CopyEdit
{
  "_id": "comment_id",
  "video_id": "vid123",
  "user_id": "user456",
  "text": "Nice video!",
  "created_at": "2024-03-30T12:00:00Z",
  "replies": [
    { "user_id": "user789", "text": "Agreed!", "created_at": "..." }
  ]
}

Likes Table (for fast toggle)

DynamoDB table with composite key: (video_id, user_id)

4. Watch History & Resume Playback

DB: Wide Column Store (Apache Cassandra / ScyllaDB)

1. Tradeoff:

Append-heavy workload.
Needs fast writes and predictable reads (per-user query).
Wide column DBs are perfect for this time-series-like pattern.

2. Scalability:

Proven to scale to billions of rows.
Netflix uses Cassandra for exactly this use case.

3. Schema:

WatchHistory Table

cql
CopyEdit
CREATE TABLE watch_history (
  user_id UUID,
  video_id UUID,
  last_watched TIMESTAMP,
  position_seconds INT,
  PRIMARY KEY (user_id, last_watched)
) WITH CLUSTERING ORDER BY (last_watched DESC);

5. Subscriptions & Billing (Netflix)

DB: Relational DB (PostgreSQL)

1. Tradeoff:

Financial data must be ACID-compliant.
Supports relationships between plans, users, and transactions.

2. Scalability:

Can be scaled using sharding per user ID and/or using distributed SQL solutions later.

3. Schema:

Subscriptions

sql
CopyEdit
CREATE TABLE subscriptions (
  subscription_id UUID PRIMARY KEY,
  user_id UUID REFERENCES users(user_id),
  plan_type VARCHAR(50),
  status VARCHAR(20),
  start_date DATE,
  end_date DATE
);

Payments

sql
CopyEdit
CREATE TABLE payments (
  payment_id UUID PRIMARY KEY,
  subscription_id UUID REFERENCES subscriptions(subscription_id),
  amount DECIMAL,
  status VARCHAR(20),
  transaction_date TIMESTAMP
);

6. Search & Recommendations

DB: Elasticsearch + Graph DB (optional)

1. Tradeoff:

Elasticsearch for keyword/tag search.
Graph DB (like Neo4j or AWS Neptune) useful for complex recommendation logic (users who watched X also watched Y).

2. Scalability:

Elasticsearch horizontally scales via shards.
For large graphs, use Graph DB + precomputed offline recommendation pipelines.

7. CDN & Video Files

Storage: Object Store (Amazon S3, Google Cloud Storage)

Edge Delivery: CDN (Cloudflare, Akamai, AWS CloudFront)

1. Tradeoff:

Video content is best stored as blobs, not in a traditional DB.
Object stores are highly durable (99.999999999%) and cost-effective.
CDNs offload traffic and reduce latency.

2. Scalability:

Practically infinite scaling for video files.
CDN edge nodes can handle tens of Tbps.

8. Analytics, Logs, and Metrics

DB: Data Lake (S3 + Presto) + Time Series DB (Prometheus / InfluxDB)

1. Tradeoff:

Raw logs go into object store, queried via Presto or Spark.
Real-time metrics (CPU, memory, traffic) go into Prometheus.
Useful for dashboards, alerts, usage patterns.

High-level design

1. API Gateway

Functionality:

Acts as the entry point for all client requests (web, mobile, TV).
Handles authentication, rate limiting, logging, and request routing.
Directs requests to the appropriate backend service.

2. User Service

Functionality:

Manages user registration, login, logout.
Handles user profiles, preferences, and account settings.
Issues and validates JWT tokens or OAuth tokens for secure access.

3. Video Upload/Ingestion Service

Functionality:

Accepts uploaded videos from users or studios.
Validates file type, size, and formats.
Stores raw video temporarily in a staging bucket.
Triggers the video encoding pipeline.

4. Video Encoding Service

Functionality:

Transcodes videos into multiple resolutions and formats (e.g., 144p to 4K).
Generates adaptive bitrate streams (HLS, MPEG-DASH).
Stores processed videos in object storage (e.g., S3).
Updates video metadata with streamable URLs and status.

5. Metadata Service

Functionality:

Stores and serves video metadata (title, description, tags, duration).
Provides APIs to list/search/filter videos.
Synchronizes with Elasticsearch for full-text search support.

6. Playback Service

Functionality:

Handles requests for video playback.
Returns signed URLs or CDN paths for adaptive streaming.
Stores playback progress per user (for resume support).

7. Comment & Like Service

Functionality:

Manages likes, dislikes, and comments per video.
Provides APIs for creating, retrieving, updating, and moderating comments.
Supports nested replies and like/dislike counts.

8. Watch History Service

Functionality:

Tracks which videos users have watched and when.
Supports "Continue Watching" and personalized home page.
Stores watch progress and timestamps.

9. Search & Recommendation Service

Functionality:

Provides keyword-based search using Elasticsearch.
Generates personalized recommendations using collaborative filtering or ML models.
Ranks trending videos, new releases, and user-specific suggestions.

10. Subscription & Billing Service (Netflix-specific)

Functionality:

Manages plans, subscriptions, and payment status.
Integrates with payment gateways (Stripe, Razorpay, etc.)
Handles access control to premium content based on active subscription.

11. Notification Service

Functionality:

Sends in-app, email, or push notifications for new uploads, subscriptions, or recommendations.
Integrates with messaging systems like Kafka or RabbitMQ.

12. Admin Service

Functionality:

Allows internal users to manage users, videos, comments, and abuse reports.
Approve or reject videos (if needed).
View analytics dashboards.

13. Content Delivery Network (CDN)

Functionality:

Caches and delivers video content from edge servers closer to users.
Reduces latency and bandwidth cost.
Supports high concurrency during peak traffic.

14. Analytics & Logging Pipeline

Functionality:

Collects logs for video views, playback errors, buffering, engagement, etc.
Streams data into a data lake (e.g., S3 + Presto).
Supports dashboards, insights, ML training pipelines.

15. Database Cluster(s)

Functionality:

Stores structured data (PostgreSQL for metadata, users, subscriptions).
Stores unstructured or large data (MongoDB for comments, Cassandra for watch history).
Distributed, replicated, and partitioned for scale.

16. Object Storage

Functionality:

Stores original and encoded video files, thumbnails, and subtitle files.
Provides durability and availability guarantees.
Works with CDN for content delivery.

Request flows

1. User Signup/Login Flow

Scenario: User signs up or logs in

Client (Web/Mobile) sends POST /users/register or POST /users/login to API Gateway.
API Gateway forwards the request to User Service.
User Service verifies credentials (or stores new user), hashes passwords, and stores info in User DB (PostgreSQL).
On success, it generates a JWT access token and sends it back via API Gateway.

2. Video Upload Flow (YouTube-style)

Scenario: Creator uploads a video

Client sends POST /videos/upload request to API Gateway with video file and metadata.
API Gateway routes to Video Upload Service.
Upload Service saves raw video to staging storage bucket (S3/Blob Storage).
It then sends a message to a queue (Kafka/SQS) to trigger Video Encoding Service.
Encoding Service reads video from staging, processes it into various resolutions, and stores output in object storage.
On success, it updates Metadata Service with the encoded video URLs and playback readiness.
Metadata Service stores metadata in PostgreSQL and indexes it in Elasticsearch.

3. Video Playback Flow

Scenario: User clicks “Play” on a video

Client sends GET /videos/{videoId}/play to API Gateway.
API Gateway calls Playback Service.
Playback Service verifies user access (subscription status if Netflix), then retrieves the stream manifest URL (HLS/DASH) from Metadata DB.
Returns signed URL or CDN path to client.
Client player streams video directly from CDN using adaptive bitrate.

4. Search & Browse Flow

Scenario: User searches for a video

Client sends GET /search?q=query to API Gateway.
API Gateway routes to Search Service.
Search Service queries Elasticsearch index for matching videos.
Results (with partial metadata) are returned to client.
For full video details (e.g., views, uploader name), frontend makes follow-up calls to Metadata Service.

5. Like / Comment Flow

Scenario: User likes or comments on a video

Client sends POST /videos/{videoId}/like or POST /videos/{videoId}/comments.
API Gateway routes to Like/Comment Service.
The service writes to MongoDB / DynamoDB and optionally updates like/comment counters in Metadata DB.
Counters can be asynchronously updated to avoid write contention.

6. Watch History Flow

Scenario: User watches a video partially

Client periodically sends POST /users/{userId}/history/{videoId} with current playback time.
API Gateway routes to Watch History Service.
The service writes a record to Cassandra (user ID + timestamp + position).
When user returns, GET request to same service fetches last playback time to resume.

7. Recommendation Flow

Scenario: User opens homepage

Client sends GET /videos/recommendations to API Gateway.
Gateway calls Recommendation Service.
Service queries pre-computed results from:
- Cache (Redis)
- Graph DB (optional)
- or from ML model inference via internal APIs
Response includes a curated list of videos with basic metadata.
Client fetches detailed metadata by calling Metadata Service.

8. Subscription & Payment Flow (Netflix)

Scenario: User subscribes to a plan

Client sends POST /subscribe to API Gateway.
API Gateway forwards to Subscription Service.
Subscription Service invokes Payment Gateway API (e.g., Stripe).
On success, subscription is recorded in PostgreSQL, and access is updated in Access Control DB.
Future playback requests will validate this subscription before serving content.

9. CDN & Video Delivery Flow

Scenario: Content is streamed from nearest edge

After retrieving the streaming manifest from Playback Service, the video is played from CDN edge node.
If video is not cached, CDN pulls from origin (object store like S3) and then caches it.

10. Admin Moderation Flow

Scenario: Admin removes flagged video

Admin sends PUT /admin/videos/{videoId}/remove via internal dashboard.
Admin Service verifies access and updates video status in Metadata DB.
If needed, encoding artifacts can be deleted from storage.
Changes are propagated to cache and search index.

Detailed component design

1. API Gateway

1. How it handles requirements

Functional:

Routes requests to appropriate microservices via routing tables.
Handles authentication (JWT/OAuth token parsing and validation).
Implements rate limiting using token buckets or leaky bucket algorithms.
Supports protocol transformations (e.g., HTTP/2 to HTTP/1.1).

Non-functional:

Scalability: Horizontally scalable stateless service.
Security: Enforces HTTPS, CORS policies, input validation.
Observability: Logs all requests, integrates with tracing systems like OpenTelemetry.

2. Interactions

Forwards requests to downstream services via REST/gRPC.
Reads service discovery info via Consul/Eureka or API configs via Kong/Apigee/Nginx.

3. Algorithms / Data Structures

Routing via prefix trees or hash maps (path → service)
Token Bucket for rate limiting
LRU Caches for JWT token introspection and service discovery TTLs

2. User Service

1. How it handles requirements

Functional:

Manages account creation, login, logout, profile settings.
Authenticates users and issues JWT tokens.
Supports email verification and password hashing (using bcrypt/scrypt).

Non-functional:

Security: Stores passwords securely (bcrypt), uses HTTPS, and CSRF tokens for forms.
Availability: Deploys in multi-region, uses database replicas.
Performance: Caches frequently accessed user profiles using Redis.

2. Interactions

Communicates with PostgreSQL to read/write user info.
Sends messages to Notification Service for email/SMS.
Interacts with Auth service to validate tokens (if decoupled).

3. Algorithms / Data Structures

Hash maps in Redis for user profile cache
Bloom filters to check username/email uniqueness before DB hit
Secure password hashing algorithms (bcrypt, Argon2)

3. Video Upload/Ingestion Service

1. How it handles requirements

Functional:

Accepts video files via multipart uploads or resumable uploads (Tus protocol).
Validates file formats and stores in staging object storage.
Publishes a job to a message queue (Kafka/SQS) for encoding.

Non-functional:

Scalability: Upload service is stateless and horizontally scalable.
Fault Tolerance: Resumable upload protocols ensure incomplete files aren’t lost.
Durability: Relies on S3/Blob Storage for persistent raw file storage.

2. Interactions

Writes to object storage (e.g., S3).
Publishes to Kafka → video_ingest_topic for encoding pipeline.
Updates Metadata Service via REST/gRPC once the video is ready.

3. Algorithms / Data Structures

Chunked upload buffering (Ring buffer / Sliding window)
MD5 hash checks to verify upload integrity
Retry queues with exponential backoff for failure handling

4. Video Encoding Service

1. How it handles requirements

Functional:

Consumes video files, runs FFmpeg to generate HLS/DASH formats.
Generates thumbnails and captions (auto-transcription using NLP).
Updates Metadata DB with encoding status.

Non-functional:

Performance: Parallel encoding pipelines using containerized workers (K8s + FFmpeg).
Scalability: Each worker pod can process a queue of tasks independently.
Reliability: Encodes idempotently; failed jobs are retried.

2. Interactions

Reads from object storage.
Publishes status to Kafka topic → video_encoded_topic.
Calls Metadata Service and stores processed file URLs.

3. Algorithms / Data Structures

Priority queues for encoding jobs (based on popularity or VIP)
FFmpeg-based codecs and bitrate ladder generation
Scene change detection algorithms to pick thumbnails

5. Metadata Service

1. How it handles requirements

Functional:

Stores title, description, tags, upload status, visibility.
Supports update, fetch, and list APIs for metadata.
Synchronizes searchable data with Elasticsearch.

Non-functional:

Performance: Heavily read-optimized with caching layers (Redis).
Availability: Deployed with read replicas and load-balanced.
Consistency: Uses DB transactions to ensure accurate metadata state.

2. Interactions

PostgreSQL or MySQL for persistent metadata.
Elasticsearch for search indexing.
Talks to Comment Service, Like Service, and Recommendation Service for composite views.

3. Algorithms / Data Structures

Inverted index in Elasticsearch for tag/title search.
Redis caching with TTL and LRU for top video metadata.
Background sync pipelines for reindexing and bulk update jobs.

6. Playback Service

1. How it handles requirements

Functional:

Validates access to video.
Returns manifest URL for playback (e.g., .m3u8 for HLS).
Logs watch start events.

Non-functional:

Security: Generates signed expirable URLs.
Low Latency: Returns CDN path quickly via pre-signed URL or token.
Scalability: Stateless, cache pre-generated manifest paths in Redis.

2. Interactions

Metadata Service for playback readiness.
CDN and Object Storage for streaming URLs.
Watch History Service to record progress.

3. Algorithms / Data Structures

Access token generation using HMAC/SHA256
Time-based token invalidation (e.g., exp claim in JWT or signed query param)
Ring buffer or message queue for live video buffering (if supporting live streams)

7. Comment & Like Service

1. How it handles requirements

Functional:

Allows users to add/edit/delete comments.
Tracks likes and dislikes.
Supports threading and replies.

Non-functional:

High Throughput: NoSQL (MongoDB/DynamoDB) ensures horizontal write scalability.
Consistency: Eventual consistency is acceptable.
Caching: Top N comments cached in Redis.

2. Interactions

MongoDB/DynamoDB for storage.
Metadata Service to update like counts asynchronously.
Notification Service for mention/tag alerts.

3. Algorithms / Data Structures

Nested documents for threaded comments
Counters using Redis INCR or DynamoDB atomic counters
Pagination with cursor-based scrolling for performance

8. Watch History Service

1. How it handles requirements

Functional:

Records which video user watched, at what time, and how much.
Supports resume functionality.

Non-functional:

High Write Volume: Uses Cassandra for write-optimized time-series-like data.
Availability: Highly replicated.
Low Latency Reads: Indexed by user ID with clustering by timestamp.

2. Interactions

Receives POSTs from clients during playback.
Called by Playback Service to resume.
Feeds Recommendation Service with data for personalization.

3. Algorithms / Data Structures

Wide column structure: rows per user, columns per video
Bloom filters to skip reads on missing entries
TTL-based data expiry to remove old records

9. Recommendation Service

1. How it handles requirements

Functional:

Personalizes homepage and suggestions.
Uses collaborative filtering, content-based filtering, or deep learning models.

Non-functional:

Scalability: Pre-computed results are cached (e.g., Redis or Materialized Views).
Latency: Serve results in <100ms from cache or inference engine.

2. Interactions

Metadata Service for video info.
Watch History, Like, Comment Services as input signals.
ML pipelines fetch data from Data Lake for model training.

3. Algorithms / Data Structures

Matrix factorization (ALS), KNN, Word2Vec for embeddings
Graph traversal (for “people who watched this also watched”)
Real-time ranking using features: recency, CTR, completion rate

10. Subscription & Billing Service

1. How it handles requirements

Functional:

Manages plans, trial status, and billing cycles.
Integrates with Stripe/PayPal etc.

Non-functional:

Strong consistency for financial transactions.
Idempotent APIs to avoid duplicate billing.

2. Interactions

PostgreSQL for subscriptions/payments.
Calls external payment gateway APIs.
Sets access control flags used by Playback Service.

3. Algorithms / Data Structures

Event-sourcing or outbox pattern for reliable payment state changes
Cron-based batch jobs for monthly billing
Time-series tables for usage metering

Trade offs/Tech choices

1. Microservices Architecture

✅ Chosen: Microservices

Why: Enables modularity, independent scaling, tech heterogeneity, better fault isolation.
Trade-off: Increased operational complexity (network calls, service discovery, distributed tracing, eventual consistency).

❌ Alternative: Monolith

Easier to build and deploy initially, but would not scale with billions of requests and services with independent load patterns (e.g., comment service vs. encoding service).

2. PostgreSQL / Relational DB for Core Metadata & User Info

✅ Chosen: PostgreSQL for structured data

Why: ACID compliance, rich query language, integrity constraints, easy indexing for video metadata and user relations.
Trade-off: Sharding and replication are harder than NoSQL, but manageable with tools like Citus, Vitess, or CockroachDB.

❌ Alternative: NoSQL (e.g., MongoDB)

Would compromise on ACID guarantees for user/account data.
Relational schema is more appropriate for subscriptions and foreign-key-heavy data.

3. NoSQL (MongoDB / DynamoDB) for Comments & Likes

✅ Chosen: MongoDB/DynamoDB

Why: Schema-less, high write throughput, easy to scale horizontally. Perfect for comment documents and atomic counters for likes.
Trade-off: Joins are difficult. Cross-entity consistency is not guaranteed.

❌ Alternative: Relational DB

Would struggle with performance under massive comment volume (e.g., 100B+ rows).
Writing deeply nested replies or bulk likes would be inefficient.

4. Wide Column DB (Cassandra) for Watch History

✅ Chosen: Cassandra

Why: Write-optimized, high availability, linear horizontal scaling. Perfect for time-series style workloads.
Trade-off: Consistency is tunable but not strong by default. Requires careful schema design (denormalization, partition keys).

❌ Alternative: Relational DB

Would choke under the write volume from billions of watch records/day.
Indexing on timestamp per user would become a bottleneck.

5. Object Storage (S3/GCS) + CDN for Video Delivery

✅ Chosen: Object Store + CDN

Why: Cost-effective, durable (11 9s), infinite scalability. CDNs improve latency and bandwidth efficiency.
Trade-off: Cannot directly stream from DB. Requires chunked and signed URL delivery with cache invalidation strategies.

❌ Alternative: Storing videos in file systems or databases

Impractical. Increases cost, complexity, and reduces throughput compared to object storage.

6. Elasticsearch for Search

✅ Chosen: Elasticsearch

Why: Powerful full-text search engine with tokenization, relevance ranking, and filtering. Scales horizontally.
Trade-off: Requires sync mechanism with metadata DB. Writes are expensive and not ACID.

❌ Alternative: SQL full-text search

Not scalable. Slower and less relevant ranking at large scale.

7. Redis for Caching Hot Data

✅ Chosen: Redis

Why: Sub-millisecond latency for read-heavy operations (e.g., hot videos, top comments, playback positions).
Trade-off: Data in Redis is volatile and must be backed by persistent stores.

❌ Alternative: Memcached

No persistence, limited data structure support compared to Redis.

8. Kafka for Asynchronous Processing

✅ Chosen: Kafka

Why: Distributed, fault-tolerant pub/sub system. Ideal for encoding jobs, analytics logging, notifications.
Trade-off: Requires ops effort to manage. Ordering within partitions but not across.

❌ Alternative: Direct HTTP sync calls or RabbitMQ

RabbitMQ is good for traditional queue semantics but doesn’t scale as well for logs or broadcast to many consumers.

9. FFmpeg for Encoding

✅ Chosen: FFmpeg (wrapped in worker containers)

Why: Open-source, highly customizable for video encoding/transcoding workflows.
Trade-off: Needs wrapper scripts and scaling via orchestration (e.g., Kubernetes jobs or batch processing).

❌ Alternative: Cloud-based encoding services

Easier to integrate but expensive and less flexible for large-scale systems like YouTube.

10. Token-based Auth (JWT)

✅ Chosen: JWT

Why: Stateless authentication for horizontal scaling of all stateless services.
Trade-off: Revocation is tricky. Needs short TTLs and refresh tokens.

❌ Alternative: Session-based auth

Centralized session store becomes a bottleneck in globally distributed systems.

Failure scenarios/bottlenecks & Mitigations

🔧 1. API Gateway Failures

Potential Failures:

Gateway crash / instance down
Overload / DDoS
Misconfigured routing rules
Token validation slowness

Bottlenecks:

TLS handshake overhead under massive traffic
JWT validation becoming CPU-bound

Mitigations:

Deploy behind a load balancer with auto-scaling
Use a fast in-memory JWT validation cache (Redis)
Implement global rate limiting (leaky bucket/token bucket)
Use WAF to filter abusive IPs

📦 2. Upload / Ingestion Failures

Potential Failures:

Large file uploads time out
Client loses connection midway
File corruption on upload
Queue or worker crash before ingestion

Bottlenecks:

File validation becoming slow for massive uploads
Sequential processing of uploads limiting throughput

Mitigations:

Use resumable chunked uploads (e.g., TUS protocol)
Validate file type/length on client before upload
Store uploads in durable object store immediately
Use a persistent queue (Kafka) to decouple processing
Ensure idempotent workers (retry-safe ingestion)

🎞 3. Video Encoding Failures

Potential Failures:

Worker node crash during encoding
Out-of-memory errors during 4K processing
Encoding backlog from too many uploads
Incorrect format/codec handling

Bottlenecks:

FFmpeg CPU-bound under high resolution
IO bottlenecks reading large files from storage

Mitigations:

Use containerized workers (Kubernetes Jobs) with memory limits
Auto-scale encoding worker pools
Retry failed jobs with backoff
Pre-check input format before encoding
Use job priority queue for trending/high-priority content

📂 4. Metadata Service Failures

Potential Failures:

DB outage or corruption
High read/write contention
Search index out of sync
Metadata cache inconsistency

Bottlenecks:

PostgreSQL row-level locking on heavy concurrent writes
Slow joins across video, tags, and uploader tables

Mitigations:

Read replicas for scale-out
Use connection poolers (e.g., PgBouncer)
Separate read and write models (CQRS pattern)
Async indexing pipeline for Elasticsearch
Cache top video metadata in Redis with short TTLs

🎥 5. Playback Flow Failures

Potential Failures:

Invalid or expired playback token
CDN cache miss or region latency
Wrong manifest returned (bad encoding metadata)

Bottlenecks:

Redis token cache becomes single point of failure
CDN origin throttle when too many videos are cold

Mitigations:

Generate signed URLs with short expiry for security
Pre-warm CDN cache for popular content
Use multi-CDN fallback (e.g., CloudFront + Akamai)
Store video metadata with redundancy

💬 6. Comments & Likes Failures

Potential Failures:

Hot video leads to write hotspot
User spams like/dislike rapidly
Inconsistent comment count due to async processing

Bottlenecks:

MongoDB primary becoming write bound
DynamoDB partition throttling due to uneven traffic

Mitigations:

Use sharding or partition keys (e.g., video_id%N)
Apply write throttling / debounce logic on client side
Eventually update comment counters asynchronously
Limit replies depth or flatten very deep threads

👁 7. Watch History Failures

Potential Failures:

Inconsistent playback resume point
High write throughput causes Cassandra slowdown
Old history not cleaned up → storage bloat

Bottlenecks:

Time-series writes becoming IO-bound
Querying large partitions for binge watchers

Mitigations:

TTL to expire history beyond N days
Use clustering order DESC for most recent entries
Write throttling from clients (batch resume sync every 30s)
Use bounded partitions (e.g., per month per user)

🔍 8. Search & Recommendation Failures

Potential Failures:

Elasticsearch indexing lag
Missing videos in search due to partial index failure
Stale recommendations

Bottlenecks:

Elasticsearch full cluster GC pauses
ML model prediction latency at inference time

Mitigations:

Use bulk async indexers with retry and checkpointing
Monitor indexing lag and alert on discrepancy
Cache home page recommendations in Redis per user
Use ML inference caching for most frequent queries

💳 9. Subscription & Billing Failures

Potential Failures:

Payment gateway timeout or failure
Incorrect subscription tier validation
Double charges

Bottlenecks:

High write volume at billing cycle (e.g., 1st of the month)
High latency if integrated via sync payment APIs

Mitigations:

Use idempotent billing APIs with transaction IDs
Async webhooks from Stripe → write to outbox/event log
Flag users as pending_payment until confirmation
Run billing in scheduled batches

📡 10. CDN and Video Delivery Failures

Potential Failures:

CDN cache miss = slow load
Origin fetch throttling
Geo-restriction misconfigured

Bottlenecks:

Cold start latency on un-cached videos
CDN edge saturation in certain regions

Mitigations:

Pre-warm top videos before release (scheduled cache prefill)
Serve fallback bitrate while higher resolution loads
Use signed CDN URLs with geo-policy enforcement

📈 11. Logging, Monitoring, and Observability Failures

Potential Failures:

Missing logs due to buffer loss
Slow dashboard queries (e.g., Prometheus / Grafana)
Unnoticed system-wide slowness

Bottlenecks:

Too many metrics → cardinality explosion
Unbounded log ingestion → storage cost + delays

Mitigations:

Use structured logging + log levels per service
Push logs through Kafka + FluentD to a central store (e.g., S3, Loki)
Pre-aggregate metrics and alerts (e.g., 99p latency)
Use circuit breakers + alerting (e.g., via Prometheus + AlertManager)

🔄 General Cross-Service Bottlenecks

Scenarios:

N+1 service calls
Global mutex behaviour (e.g., updating trending counts)
Thundering herd problem (cache miss for trending video)

Mitigations:

Use batch APIs or data loaders
Asynchronous eventual counters for analytics
Employ request coalescing or lock-striping for high-contention updates
Use cache stampede protection (e.g., request deduplication with singleflight pattern)