My Solution for Design Youtube or Netflix with Score: 8/10
by iridescent_luminous693
System requirements
1. Functional Requirements (FR)
- User Management
- Sign up, login, logout
- Multiple profiles per account (Netflix)
- Video Uploading / Ingestion
- Users upload videos (YouTube)
- Ingestion pipeline from studios or content providers (Netflix)
- Add metadata: title, description, tags, etc.
- Generate thumbnails
- Video Encoding & Transcoding
- Convert to multiple resolutions and formats
- Generate adaptive bitrate streams (HLS, MPEG-DASH)
- Video Streaming
- Stream videos efficiently based on user bandwidth
- Support resume from last watch position
- Content Discovery
- Search by title, genre, actor, etc.
- Sort and filter results
- Categories like trending, new releases, etc.
- Recommendations
- Personalized video suggestions
- Continue watching list
- Auto-play next episode
- User Interactions (YouTube)
- Like, dislike, comment on videos
- Subscribe to channels
- Create playlists
- Watch History
- Maintain user's playback history
- Show “recently watched” or “watch again”
- Subscriptions & Payments (Netflix)
- Choose and manage subscription plans
- Integration with payment gateways
- Access control based on active subscription
- Admin Features
- Upload/manage platform content
- Ban/report content or users
- Manage payments and user analytics
2. Non-Functional Requirements (NFR)
- Scalability
- Handle millions of concurrent users and videos
- Auto-scale storage and compute resources
- High Availability
- Minimal downtime
- Replication across multiple regions
- Performance
- Fast load and buffer times
- Low-latency streaming
- Security
- Encrypted user data and video content
- Digital Rights Management (DRM) for premium content
- Prevent piracy and unauthorized sharing
- Reliability
- Handle failures gracefully
- Retry mechanisms and redundancy
- Maintainability
- Clean architecture and modular design
- Easy to patch and update services
- Extensibility
- Add support for new content types (e.g., live streaming)
- Easily integrate third-party APIs (e.g., ad services)
- Monitoring & Logging
- Real-time analytics on views, errors, system health
- Log user activity for debugging and personalization
- SEO Optimization (YouTube)
- Optimize metadata and video pages for search engines
- Compliance
- GDPR, COPPA, or other legal regulations depending on geography
Capacity estimation
1. User Base and Traffic
For YouTube-scale:
- Around 2.5 billion monthly active users.
- Roughly 1 billion users use the platform daily.
- Each user might watch 60–90 minutes of video per day.
- Peak concurrent viewers can go up to 10–20 million.
For Netflix-scale:
- Around 250 million monthly active users.
- Roughly 100 million use it daily.
- Each user might stream 90–120 minutes per day.
- Peak concurrency is usually around 5–10 million users.
2. Video Upload and Storage (mostly applies to YouTube)
- Around 500 hours of video are uploaded every minute.
- That totals to 720,000 hours of video per day.
- If 1 hour of 1080p video takes about 2.25 GB, the daily storage requirement is roughly 1.6 petabytes.
- Monthly, this comes to about 50 petabytes of new video being added.
3. Transcoding and Video Variants
- Each uploaded video is transcoded into multiple formats and resolutions (e.g., 144p, 240p, 360p, 720p, 1080p, 4K).
- On average, there are 5–10 variants for each video.
- So the total storage consumption due to transcoding increases by a factor of 5 to 10.
- This means annual storage might grow by around 250 to 500 petabytes.
4. Video Streaming and Bandwidth
For Netflix:
- Assume 100 million users stream for about 2 hours a day.
- At 1080p (around 5 Mbps), the total daily data transfer is close to 560 petabytes.
- At peak time, with 20 million concurrent users at 5 Mbps, you’d need to serve around 100 terabits per second.
5. Metadata and Comment Storage
- Each video might have about 10 KB of metadata.
- For 1 billion videos, that’s around 10 terabytes of metadata.
- If each video has 100 comments on average, that’s about 100 billion rows in the comments table.
- Likes, views, and playlists scale into trillions of records.
6. Watch History
- Assuming 1 billion users and 1000 watch history entries per user, you'd end up with around 1 trillion rows in your watch history system.
7. Caching and CDN Load
- Most watched videos are cached at the edge (CDN servers).
- Just the top 10,000 videos can serve over 90% of the traffic.
- Each CDN edge location may need around 1 to 2 terabytes of hot content to be cached.
8. Logs and Analytics
- Logging user behavior, video starts/stops, errors, and buffering events can produce multiple terabytes of logs per day.
- Analytics pipelines need to process this data in near real-time to update recommendations and dashboards.
API design
1. User APIs
Register User
POST /api/v1/users/register
Creates a new user account with email/password or social login.
Login
POST /api/v1/users/login
Authenticates user and returns access & refresh tokens.
Get User Profile
GET /api/v1/users/me
Returns the logged-in user's profile, settings, and preferences.
Update Profile
PUT /api/v1/users/me
Updates user details like name, language, playback preferences, etc.
2. Video APIs
Upload Video (YouTube-specific)
POST /api/v1/videos/upload
Allows a user to upload a new video file along with metadata (title, description, tags).
Get Video Metadata
GET /api/v1/videos/{videoId}
Fetches title, description, duration, view count, like count, etc.
List Videos
GET /api/v1/videos
Supports filters: category, trending, recent, subscriptions, etc.
Update Video Metadata
PUT /api/v1/videos/{videoId}
Updates the title, tags, description, or thumbnail.
Delete Video
DELETE /api/v1/videos/{videoId}
Removes a video from the platform (for uploader or admin).
3. Playback APIs
Get Streaming URL
GET /api/v1/videos/{videoId}/play
Returns the HLS/MPEG-DASH streaming manifest URL.
Resume Playback
GET /api/v1/users/{userId}/history/{videoId}
Returns last watched timestamp for resume support.
Save Playback Position
POST /api/v1/users/{userId}/history/{videoId}
Saves current position in video for resume functionality.
4. Interaction APIs
Like Video
POST /api/v1/videos/{videoId}/like
Adds a like by the current user.
Dislike Video
POST /api/v1/videos/{videoId}/dislike
Adds a dislike.
Comment on Video
POST /api/v1/videos/{videoId}/comments
Adds a new comment to the video.
Get Comments
GET /api/v1/videos/{videoId}/comments
Returns a paginated list of comments.
Subscribe to Channel (YouTube-specific)
POST /api/v1/channels/{channelId}/subscribe
Subscribes the user to a channel.
Watch History
GET /api/v1/users/{userId}/history
Returns a list of recently watched videos.
5. Recommendation & Discovery APIs
Get Recommendations
GET /api/v1/videos/recommendations
Returns a list of recommended videos for the user.
Search Videos
GET /api/v1/search?q=keyword
Searches for videos by title, tags, or channel.
6. Subscription & Billing APIs (Netflix-specific)
Subscribe to Plan
POST /api/v1/subscribe
User chooses a plan and triggers payment.
Get Current Subscription
GET /api/v1/subscribe
Returns current subscription details and status.
Cancel Subscription
DELETE /api/v1/subscribe
Cancels the user’s subscription.
7. Admin APIs
Approve or Remove Video
PUT /api/v1/admin/videos/{videoId}/approve
For moderating flagged videos.
Ban User
POST /api/v1/admin/users/{userId}/ban
Restricts user access due to policy violations.
View Platform Analytics
GET /api/v1/admin/analytics
Returns dashboards on usage, uploads, views, and revenue.
Database design
1. User Management
DB: Relational DB (PostgreSQL / MySQL)
1. Tradeoff:
- Strong consistency guarantees are needed for login, authentication, subscriptions, etc.
- Well-suited for structured, relational data like users, subscriptions, billing.
- Mature ecosystem, supports transactions.
2. Scalability:
- Can handle hundreds of millions of users with read replicas and sharding by user ID.
- Might require migration to a distributed SQL DB (e.g., CockroachDB or Vitess) if traffic becomes global and extreme.
3. Schema:
User Table
sql
CopyEdit
CREATE TABLE users (
user_id UUID PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL,
password_hash TEXT NOT NULL,
name VARCHAR(100),
created_at TIMESTAMP,
last_login TIMESTAMP
);
Profile Table (Netflix-style multiple profiles)
sql
CopyEdit
CREATE TABLE profiles (
profile_id UUID PRIMARY KEY,
user_id UUID REFERENCES users(user_id),
profile_name VARCHAR(100),
language VARCHAR(10),
is_kid BOOLEAN
);
2. Video Metadata
DB: Relational DB (PostgreSQL) + Search Index (Elasticsearch)
1. Tradeoff:
- PostgreSQL is great for video metadata with clear relationships.
- Elasticsearch is used to support full-text search and filters (tags, titles, channels).
2. Scalability:
- Metadata alone isn’t heavy—PostgreSQL will scale with partitioning.
- Elasticsearch clusters scale horizontally for search.
3. Schema:
Videos
sql
CopyEdit
CREATE TABLE videos (
video_id UUID PRIMARY KEY,
uploader_id UUID REFERENCES users(user_id),
title TEXT,
description TEXT,
category VARCHAR(50),
visibility VARCHAR(20), -- public/private/unlisted
upload_time TIMESTAMP,
duration INT,
status VARCHAR(20) -- processing, ready, failed
);
Tags
sql
CopyEdit
CREATE TABLE video_tags (
video_id UUID REFERENCES videos(video_id),
tag TEXT
);
3. Comments & Likes
DB: NoSQL (MongoDB / DynamoDB)
1. Tradeoff:
- High volume of writes and reads.
- Comments and likes don’t need strong joins or ACID transactions.
- Document structure fits well (nested replies, metadata).
2. Scalability:
- Easily scalable horizontally.
- DynamoDB with partition key = video_id ensures even traffic distribution.
3. Schema:
Comments Document (MongoDB-style)
json
CopyEdit
{
"_id": "comment_id",
"video_id": "vid123",
"user_id": "user456",
"text": "Nice video!",
"created_at": "2024-03-30T12:00:00Z",
"replies": [
{ "user_id": "user789", "text": "Agreed!", "created_at": "..." }
]
}
Likes Table (for fast toggle)
- DynamoDB table with composite key:
(video_id, user_id)
4. Watch History & Resume Playback
DB: Wide Column Store (Apache Cassandra / ScyllaDB)
1. Tradeoff:
- Append-heavy workload.
- Needs fast writes and predictable reads (per-user query).
- Wide column DBs are perfect for this time-series-like pattern.
2. Scalability:
- Proven to scale to billions of rows.
- Netflix uses Cassandra for exactly this use case.
3. Schema:
WatchHistory Table
cql
CopyEdit
CREATE TABLE watch_history (
user_id UUID,
video_id UUID,
last_watched TIMESTAMP,
position_seconds INT,
PRIMARY KEY (user_id, last_watched)
) WITH CLUSTERING ORDER BY (last_watched DESC);
5. Subscriptions & Billing (Netflix)
DB: Relational DB (PostgreSQL)
1. Tradeoff:
- Financial data must be ACID-compliant.
- Supports relationships between plans, users, and transactions.
2. Scalability:
- Can be scaled using sharding per user ID and/or using distributed SQL solutions later.
3. Schema:
Subscriptions
sql
CopyEdit
CREATE TABLE subscriptions (
subscription_id UUID PRIMARY KEY,
user_id UUID REFERENCES users(user_id),
plan_type VARCHAR(50),
status VARCHAR(20),
start_date DATE,
end_date DATE
);
Payments
sql
CopyEdit
CREATE TABLE payments (
payment_id UUID PRIMARY KEY,
subscription_id UUID REFERENCES subscriptions(subscription_id),
amount DECIMAL,
status VARCHAR(20),
transaction_date TIMESTAMP
);
6. Search & Recommendations
DB: Elasticsearch + Graph DB (optional)
1. Tradeoff:
- Elasticsearch for keyword/tag search.
- Graph DB (like Neo4j or AWS Neptune) useful for complex recommendation logic (users who watched X also watched Y).
2. Scalability:
- Elasticsearch horizontally scales via shards.
- For large graphs, use Graph DB + precomputed offline recommendation pipelines.
7. CDN & Video Files
Storage: Object Store (Amazon S3, Google Cloud Storage)
Edge Delivery: CDN (Cloudflare, Akamai, AWS CloudFront)
1. Tradeoff:
- Video content is best stored as blobs, not in a traditional DB.
- Object stores are highly durable (99.999999999%) and cost-effective.
- CDNs offload traffic and reduce latency.
2. Scalability:
- Practically infinite scaling for video files.
- CDN edge nodes can handle tens of Tbps.
8. Analytics, Logs, and Metrics
DB: Data Lake (S3 + Presto) + Time Series DB (Prometheus / InfluxDB)
1. Tradeoff:
- Raw logs go into object store, queried via Presto or Spark.
- Real-time metrics (CPU, memory, traffic) go into Prometheus.
- Useful for dashboards, alerts, usage patterns.
High-level design
1. API Gateway
Functionality:
- Acts as the entry point for all client requests (web, mobile, TV).
- Handles authentication, rate limiting, logging, and request routing.
- Directs requests to the appropriate backend service.
2. User Service
Functionality:
- Manages user registration, login, logout.
- Handles user profiles, preferences, and account settings.
- Issues and validates JWT tokens or OAuth tokens for secure access.
3. Video Upload/Ingestion Service
Functionality:
- Accepts uploaded videos from users or studios.
- Validates file type, size, and formats.
- Stores raw video temporarily in a staging bucket.
- Triggers the video encoding pipeline.
4. Video Encoding Service
Functionality:
- Transcodes videos into multiple resolutions and formats (e.g., 144p to 4K).
- Generates adaptive bitrate streams (HLS, MPEG-DASH).
- Stores processed videos in object storage (e.g., S3).
- Updates video metadata with streamable URLs and status.
5. Metadata Service
Functionality:
- Stores and serves video metadata (title, description, tags, duration).
- Provides APIs to list/search/filter videos.
- Synchronizes with Elasticsearch for full-text search support.
6. Playback Service
Functionality:
- Handles requests for video playback.
- Returns signed URLs or CDN paths for adaptive streaming.
- Stores playback progress per user (for resume support).
7. Comment & Like Service
Functionality:
- Manages likes, dislikes, and comments per video.
- Provides APIs for creating, retrieving, updating, and moderating comments.
- Supports nested replies and like/dislike counts.
8. Watch History Service
Functionality:
- Tracks which videos users have watched and when.
- Supports "Continue Watching" and personalized home page.
- Stores watch progress and timestamps.
9. Search & Recommendation Service
Functionality:
- Provides keyword-based search using Elasticsearch.
- Generates personalized recommendations using collaborative filtering or ML models.
- Ranks trending videos, new releases, and user-specific suggestions.
10. Subscription & Billing Service (Netflix-specific)
Functionality:
- Manages plans, subscriptions, and payment status.
- Integrates with payment gateways (Stripe, Razorpay, etc.)
- Handles access control to premium content based on active subscription.
11. Notification Service
Functionality:
- Sends in-app, email, or push notifications for new uploads, subscriptions, or recommendations.
- Integrates with messaging systems like Kafka or RabbitMQ.
12. Admin Service
Functionality:
- Allows internal users to manage users, videos, comments, and abuse reports.
- Approve or reject videos (if needed).
- View analytics dashboards.
13. Content Delivery Network (CDN)
Functionality:
- Caches and delivers video content from edge servers closer to users.
- Reduces latency and bandwidth cost.
- Supports high concurrency during peak traffic.
14. Analytics & Logging Pipeline
Functionality:
- Collects logs for video views, playback errors, buffering, engagement, etc.
- Streams data into a data lake (e.g., S3 + Presto).
- Supports dashboards, insights, ML training pipelines.
15. Database Cluster(s)
Functionality:
- Stores structured data (PostgreSQL for metadata, users, subscriptions).
- Stores unstructured or large data (MongoDB for comments, Cassandra for watch history).
- Distributed, replicated, and partitioned for scale.
16. Object Storage
Functionality:
- Stores original and encoded video files, thumbnails, and subtitle files.
- Provides durability and availability guarantees.
- Works with CDN for content delivery.
Request flows
1. User Signup/Login Flow
Scenario: User signs up or logs in
- Client (Web/Mobile) sends
POST /users/register
orPOST /users/login
to API Gateway. - API Gateway forwards the request to User Service.
- User Service verifies credentials (or stores new user), hashes passwords, and stores info in User DB (PostgreSQL).
- On success, it generates a JWT access token and sends it back via API Gateway.
2. Video Upload Flow (YouTube-style)
Scenario: Creator uploads a video
- Client sends
POST /videos/upload
request to API Gateway with video file and metadata. - API Gateway routes to Video Upload Service.
- Upload Service saves raw video to staging storage bucket (S3/Blob Storage).
- It then sends a message to a queue (Kafka/SQS) to trigger Video Encoding Service.
- Encoding Service reads video from staging, processes it into various resolutions, and stores output in object storage.
- On success, it updates Metadata Service with the encoded video URLs and playback readiness.
- Metadata Service stores metadata in PostgreSQL and indexes it in Elasticsearch.
3. Video Playback Flow
Scenario: User clicks “Play” on a video
- Client sends
GET /videos/{videoId}/play
to API Gateway. - API Gateway calls Playback Service.
- Playback Service verifies user access (subscription status if Netflix), then retrieves the stream manifest URL (HLS/DASH) from Metadata DB.
- Returns signed URL or CDN path to client.
- Client player streams video directly from CDN using adaptive bitrate.
4. Search & Browse Flow
Scenario: User searches for a video
- Client sends
GET /search?q=query
to API Gateway. - API Gateway routes to Search Service.
- Search Service queries Elasticsearch index for matching videos.
- Results (with partial metadata) are returned to client.
- For full video details (e.g., views, uploader name), frontend makes follow-up calls to Metadata Service.
5. Like / Comment Flow
Scenario: User likes or comments on a video
- Client sends
POST /videos/{videoId}/like
orPOST /videos/{videoId}/comments
. - API Gateway routes to Like/Comment Service.
- The service writes to MongoDB / DynamoDB and optionally updates like/comment counters in Metadata DB.
- Counters can be asynchronously updated to avoid write contention.
6. Watch History Flow
Scenario: User watches a video partially
- Client periodically sends
POST /users/{userId}/history/{videoId}
with current playback time. - API Gateway routes to Watch History Service.
- The service writes a record to Cassandra (user ID + timestamp + position).
- When user returns,
GET
request to same service fetches last playback time to resume.
7. Recommendation Flow
Scenario: User opens homepage
- Client sends
GET /videos/recommendations
to API Gateway. - Gateway calls Recommendation Service.
- Service queries pre-computed results from:
- Cache (Redis)
- Graph DB (optional)
- or from ML model inference via internal APIs
- Response includes a curated list of videos with basic metadata.
- Client fetches detailed metadata by calling Metadata Service.
8. Subscription & Payment Flow (Netflix)
Scenario: User subscribes to a plan
- Client sends
POST /subscribe
to API Gateway. - API Gateway forwards to Subscription Service.
- Subscription Service invokes Payment Gateway API (e.g., Stripe).
- On success, subscription is recorded in PostgreSQL, and access is updated in Access Control DB.
- Future playback requests will validate this subscription before serving content.
9. CDN & Video Delivery Flow
Scenario: Content is streamed from nearest edge
- After retrieving the streaming manifest from Playback Service, the video is played from CDN edge node.
- If video is not cached, CDN pulls from origin (object store like S3) and then caches it.
10. Admin Moderation Flow
Scenario: Admin removes flagged video
- Admin sends
PUT /admin/videos/{videoId}/remove
via internal dashboard. - Admin Service verifies access and updates video status in Metadata DB.
- If needed, encoding artifacts can be deleted from storage.
- Changes are propagated to cache and search index.
Detailed component design
1. API Gateway
1. How it handles requirements
Functional:
- Routes requests to appropriate microservices via routing tables.
- Handles authentication (JWT/OAuth token parsing and validation).
- Implements rate limiting using token buckets or leaky bucket algorithms.
- Supports protocol transformations (e.g., HTTP/2 to HTTP/1.1).
Non-functional:
- Scalability: Horizontally scalable stateless service.
- Security: Enforces HTTPS, CORS policies, input validation.
- Observability: Logs all requests, integrates with tracing systems like OpenTelemetry.
2. Interactions
- Forwards requests to downstream services via REST/gRPC.
- Reads service discovery info via Consul/Eureka or API configs via Kong/Apigee/Nginx.
3. Algorithms / Data Structures
- Routing via prefix trees or hash maps (path → service)
- Token Bucket for rate limiting
- LRU Caches for JWT token introspection and service discovery TTLs
2. User Service
1. How it handles requirements
Functional:
- Manages account creation, login, logout, profile settings.
- Authenticates users and issues JWT tokens.
- Supports email verification and password hashing (using bcrypt/scrypt).
Non-functional:
- Security: Stores passwords securely (bcrypt), uses HTTPS, and CSRF tokens for forms.
- Availability: Deploys in multi-region, uses database replicas.
- Performance: Caches frequently accessed user profiles using Redis.
2. Interactions
- Communicates with PostgreSQL to read/write user info.
- Sends messages to Notification Service for email/SMS.
- Interacts with Auth service to validate tokens (if decoupled).
3. Algorithms / Data Structures
- Hash maps in Redis for user profile cache
- Bloom filters to check username/email uniqueness before DB hit
- Secure password hashing algorithms (bcrypt, Argon2)
3. Video Upload/Ingestion Service
1. How it handles requirements
Functional:
- Accepts video files via multipart uploads or resumable uploads (Tus protocol).
- Validates file formats and stores in staging object storage.
- Publishes a job to a message queue (Kafka/SQS) for encoding.
Non-functional:
- Scalability: Upload service is stateless and horizontally scalable.
- Fault Tolerance: Resumable upload protocols ensure incomplete files aren’t lost.
- Durability: Relies on S3/Blob Storage for persistent raw file storage.
2. Interactions
- Writes to object storage (e.g., S3).
- Publishes to Kafka →
video_ingest_topic
for encoding pipeline. - Updates Metadata Service via REST/gRPC once the video is ready.
3. Algorithms / Data Structures
- Chunked upload buffering (Ring buffer / Sliding window)
- MD5 hash checks to verify upload integrity
- Retry queues with exponential backoff for failure handling
4. Video Encoding Service
1. How it handles requirements
Functional:
- Consumes video files, runs FFmpeg to generate HLS/DASH formats.
- Generates thumbnails and captions (auto-transcription using NLP).
- Updates Metadata DB with encoding status.
Non-functional:
- Performance: Parallel encoding pipelines using containerized workers (K8s + FFmpeg).
- Scalability: Each worker pod can process a queue of tasks independently.
- Reliability: Encodes idempotently; failed jobs are retried.
2. Interactions
- Reads from object storage.
- Publishes status to Kafka topic →
video_encoded_topic
. - Calls Metadata Service and stores processed file URLs.
3. Algorithms / Data Structures
- Priority queues for encoding jobs (based on popularity or VIP)
- FFmpeg-based codecs and bitrate ladder generation
- Scene change detection algorithms to pick thumbnails
5. Metadata Service
1. How it handles requirements
Functional:
- Stores title, description, tags, upload status, visibility.
- Supports update, fetch, and list APIs for metadata.
- Synchronizes searchable data with Elasticsearch.
Non-functional:
- Performance: Heavily read-optimized with caching layers (Redis).
- Availability: Deployed with read replicas and load-balanced.
- Consistency: Uses DB transactions to ensure accurate metadata state.
2. Interactions
- PostgreSQL or MySQL for persistent metadata.
- Elasticsearch for search indexing.
- Talks to Comment Service, Like Service, and Recommendation Service for composite views.
3. Algorithms / Data Structures
- Inverted index in Elasticsearch for tag/title search.
- Redis caching with TTL and LRU for top video metadata.
- Background sync pipelines for reindexing and bulk update jobs.
6. Playback Service
1. How it handles requirements
Functional:
- Validates access to video.
- Returns manifest URL for playback (e.g.,
.m3u8
for HLS). - Logs watch start events.
Non-functional:
- Security: Generates signed expirable URLs.
- Low Latency: Returns CDN path quickly via pre-signed URL or token.
- Scalability: Stateless, cache pre-generated manifest paths in Redis.
2. Interactions
- Metadata Service for playback readiness.
- CDN and Object Storage for streaming URLs.
- Watch History Service to record progress.
3. Algorithms / Data Structures
- Access token generation using HMAC/SHA256
- Time-based token invalidation (e.g.,
exp
claim in JWT or signed query param) - Ring buffer or message queue for live video buffering (if supporting live streams)
7. Comment & Like Service
1. How it handles requirements
Functional:
- Allows users to add/edit/delete comments.
- Tracks likes and dislikes.
- Supports threading and replies.
Non-functional:
- High Throughput: NoSQL (MongoDB/DynamoDB) ensures horizontal write scalability.
- Consistency: Eventual consistency is acceptable.
- Caching: Top N comments cached in Redis.
2. Interactions
- MongoDB/DynamoDB for storage.
- Metadata Service to update like counts asynchronously.
- Notification Service for mention/tag alerts.
3. Algorithms / Data Structures
- Nested documents for threaded comments
- Counters using Redis INCR or DynamoDB atomic counters
- Pagination with cursor-based scrolling for performance
8. Watch History Service
1. How it handles requirements
Functional:
- Records which video user watched, at what time, and how much.
- Supports resume functionality.
Non-functional:
- High Write Volume: Uses Cassandra for write-optimized time-series-like data.
- Availability: Highly replicated.
- Low Latency Reads: Indexed by user ID with clustering by timestamp.
2. Interactions
- Receives POSTs from clients during playback.
- Called by Playback Service to resume.
- Feeds Recommendation Service with data for personalization.
3. Algorithms / Data Structures
- Wide column structure: rows per user, columns per video
- Bloom filters to skip reads on missing entries
- TTL-based data expiry to remove old records
9. Recommendation Service
1. How it handles requirements
Functional:
- Personalizes homepage and suggestions.
- Uses collaborative filtering, content-based filtering, or deep learning models.
Non-functional:
- Scalability: Pre-computed results are cached (e.g., Redis or Materialized Views).
- Latency: Serve results in <100ms from cache or inference engine.
2. Interactions
- Metadata Service for video info.
- Watch History, Like, Comment Services as input signals.
- ML pipelines fetch data from Data Lake for model training.
3. Algorithms / Data Structures
- Matrix factorization (ALS), KNN, Word2Vec for embeddings
- Graph traversal (for “people who watched this also watched”)
- Real-time ranking using features: recency, CTR, completion rate
10. Subscription & Billing Service
1. How it handles requirements
Functional:
- Manages plans, trial status, and billing cycles.
- Integrates with Stripe/PayPal etc.
Non-functional:
- Strong consistency for financial transactions.
- Idempotent APIs to avoid duplicate billing.
2. Interactions
- PostgreSQL for subscriptions/payments.
- Calls external payment gateway APIs.
- Sets access control flags used by Playback Service.
3. Algorithms / Data Structures
- Event-sourcing or outbox pattern for reliable payment state changes
- Cron-based batch jobs for monthly billing
- Time-series tables for usage metering
Trade offs/Tech choices
1. Microservices Architecture
✅ Chosen: Microservices
- Why: Enables modularity, independent scaling, tech heterogeneity, better fault isolation.
- Trade-off: Increased operational complexity (network calls, service discovery, distributed tracing, eventual consistency).
❌ Alternative: Monolith
- Easier to build and deploy initially, but would not scale with billions of requests and services with independent load patterns (e.g., comment service vs. encoding service).
2. PostgreSQL / Relational DB for Core Metadata & User Info
✅ Chosen: PostgreSQL for structured data
- Why: ACID compliance, rich query language, integrity constraints, easy indexing for video metadata and user relations.
- Trade-off: Sharding and replication are harder than NoSQL, but manageable with tools like Citus, Vitess, or CockroachDB.
❌ Alternative: NoSQL (e.g., MongoDB)
- Would compromise on ACID guarantees for user/account data.
- Relational schema is more appropriate for subscriptions and foreign-key-heavy data.
3. NoSQL (MongoDB / DynamoDB) for Comments & Likes
✅ Chosen: MongoDB/DynamoDB
- Why: Schema-less, high write throughput, easy to scale horizontally. Perfect for comment documents and atomic counters for likes.
- Trade-off: Joins are difficult. Cross-entity consistency is not guaranteed.
❌ Alternative: Relational DB
- Would struggle with performance under massive comment volume (e.g., 100B+ rows).
- Writing deeply nested replies or bulk likes would be inefficient.
4. Wide Column DB (Cassandra) for Watch History
✅ Chosen: Cassandra
- Why: Write-optimized, high availability, linear horizontal scaling. Perfect for time-series style workloads.
- Trade-off: Consistency is tunable but not strong by default. Requires careful schema design (denormalization, partition keys).
❌ Alternative: Relational DB
- Would choke under the write volume from billions of watch records/day.
- Indexing on timestamp per user would become a bottleneck.
5. Object Storage (S3/GCS) + CDN for Video Delivery
✅ Chosen: Object Store + CDN
- Why: Cost-effective, durable (11 9s), infinite scalability. CDNs improve latency and bandwidth efficiency.
- Trade-off: Cannot directly stream from DB. Requires chunked and signed URL delivery with cache invalidation strategies.
❌ Alternative: Storing videos in file systems or databases
- Impractical. Increases cost, complexity, and reduces throughput compared to object storage.
6. Elasticsearch for Search
✅ Chosen: Elasticsearch
- Why: Powerful full-text search engine with tokenization, relevance ranking, and filtering. Scales horizontally.
- Trade-off: Requires sync mechanism with metadata DB. Writes are expensive and not ACID.
❌ Alternative: SQL full-text search
- Not scalable. Slower and less relevant ranking at large scale.
7. Redis for Caching Hot Data
✅ Chosen: Redis
- Why: Sub-millisecond latency for read-heavy operations (e.g., hot videos, top comments, playback positions).
- Trade-off: Data in Redis is volatile and must be backed by persistent stores.
❌ Alternative: Memcached
- No persistence, limited data structure support compared to Redis.
8. Kafka for Asynchronous Processing
✅ Chosen: Kafka
- Why: Distributed, fault-tolerant pub/sub system. Ideal for encoding jobs, analytics logging, notifications.
- Trade-off: Requires ops effort to manage. Ordering within partitions but not across.
❌ Alternative: Direct HTTP sync calls or RabbitMQ
- RabbitMQ is good for traditional queue semantics but doesn’t scale as well for logs or broadcast to many consumers.
9. FFmpeg for Encoding
✅ Chosen: FFmpeg (wrapped in worker containers)
- Why: Open-source, highly customizable for video encoding/transcoding workflows.
- Trade-off: Needs wrapper scripts and scaling via orchestration (e.g., Kubernetes jobs or batch processing).
❌ Alternative: Cloud-based encoding services
- Easier to integrate but expensive and less flexible for large-scale systems like YouTube.
10. Token-based Auth (JWT)
✅ Chosen: JWT
- Why: Stateless authentication for horizontal scaling of all stateless services.
- Trade-off: Revocation is tricky. Needs short TTLs and refresh tokens.
❌ Alternative: Session-based auth
- Centralized session store becomes a bottleneck in globally distributed systems.
Failure scenarios/bottlenecks & Mitigations
🔧 1. API Gateway Failures
Potential Failures:
- Gateway crash / instance down
- Overload / DDoS
- Misconfigured routing rules
- Token validation slowness
Bottlenecks:
- TLS handshake overhead under massive traffic
- JWT validation becoming CPU-bound
Mitigations:
- Deploy behind a load balancer with auto-scaling
- Use a fast in-memory JWT validation cache (Redis)
- Implement global rate limiting (leaky bucket/token bucket)
- Use WAF to filter abusive IPs
📦 2. Upload / Ingestion Failures
Potential Failures:
- Large file uploads time out
- Client loses connection midway
- File corruption on upload
- Queue or worker crash before ingestion
Bottlenecks:
- File validation becoming slow for massive uploads
- Sequential processing of uploads limiting throughput
Mitigations:
- Use resumable chunked uploads (e.g., TUS protocol)
- Validate file type/length on client before upload
- Store uploads in durable object store immediately
- Use a persistent queue (Kafka) to decouple processing
- Ensure idempotent workers (retry-safe ingestion)
🎞 3. Video Encoding Failures
Potential Failures:
- Worker node crash during encoding
- Out-of-memory errors during 4K processing
- Encoding backlog from too many uploads
- Incorrect format/codec handling
Bottlenecks:
- FFmpeg CPU-bound under high resolution
- IO bottlenecks reading large files from storage
Mitigations:
- Use containerized workers (Kubernetes Jobs) with memory limits
- Auto-scale encoding worker pools
- Retry failed jobs with backoff
- Pre-check input format before encoding
- Use job priority queue for trending/high-priority content
📂 4. Metadata Service Failures
Potential Failures:
- DB outage or corruption
- High read/write contention
- Search index out of sync
- Metadata cache inconsistency
Bottlenecks:
- PostgreSQL row-level locking on heavy concurrent writes
- Slow joins across video, tags, and uploader tables
Mitigations:
- Read replicas for scale-out
- Use connection poolers (e.g., PgBouncer)
- Separate read and write models (CQRS pattern)
- Async indexing pipeline for Elasticsearch
- Cache top video metadata in Redis with short TTLs
🎥 5. Playback Flow Failures
Potential Failures:
- Invalid or expired playback token
- CDN cache miss or region latency
- Wrong manifest returned (bad encoding metadata)
Bottlenecks:
- Redis token cache becomes single point of failure
- CDN origin throttle when too many videos are cold
Mitigations:
- Generate signed URLs with short expiry for security
- Pre-warm CDN cache for popular content
- Use multi-CDN fallback (e.g., CloudFront + Akamai)
- Store video metadata with redundancy
💬 6. Comments & Likes Failures
Potential Failures:
- Hot video leads to write hotspot
- User spams like/dislike rapidly
- Inconsistent comment count due to async processing
Bottlenecks:
- MongoDB primary becoming write bound
- DynamoDB partition throttling due to uneven traffic
Mitigations:
- Use sharding or partition keys (e.g., video_id%N)
- Apply write throttling / debounce logic on client side
- Eventually update comment counters asynchronously
- Limit replies depth or flatten very deep threads
👁 7. Watch History Failures
Potential Failures:
- Inconsistent playback resume point
- High write throughput causes Cassandra slowdown
- Old history not cleaned up → storage bloat
Bottlenecks:
- Time-series writes becoming IO-bound
- Querying large partitions for binge watchers
Mitigations:
- TTL to expire history beyond N days
- Use clustering order DESC for most recent entries
- Write throttling from clients (batch resume sync every 30s)
- Use bounded partitions (e.g., per month per user)
🔍 8. Search & Recommendation Failures
Potential Failures:
- Elasticsearch indexing lag
- Missing videos in search due to partial index failure
- Stale recommendations
Bottlenecks:
- Elasticsearch full cluster GC pauses
- ML model prediction latency at inference time
Mitigations:
- Use bulk async indexers with retry and checkpointing
- Monitor indexing lag and alert on discrepancy
- Cache home page recommendations in Redis per user
- Use ML inference caching for most frequent queries
💳 9. Subscription & Billing Failures
Potential Failures:
- Payment gateway timeout or failure
- Incorrect subscription tier validation
- Double charges
Bottlenecks:
- High write volume at billing cycle (e.g., 1st of the month)
- High latency if integrated via sync payment APIs
Mitigations:
- Use idempotent billing APIs with transaction IDs
- Async webhooks from Stripe → write to outbox/event log
- Flag users as
pending_payment
until confirmation - Run billing in scheduled batches
📡 10. CDN and Video Delivery Failures
Potential Failures:
- CDN cache miss = slow load
- Origin fetch throttling
- Geo-restriction misconfigured
Bottlenecks:
- Cold start latency on un-cached videos
- CDN edge saturation in certain regions
Mitigations:
- Pre-warm top videos before release (scheduled cache prefill)
- Serve fallback bitrate while higher resolution loads
- Use signed CDN URLs with geo-policy enforcement
📈 11. Logging, Monitoring, and Observability Failures
Potential Failures:
- Missing logs due to buffer loss
- Slow dashboard queries (e.g., Prometheus / Grafana)
- Unnoticed system-wide slowness
Bottlenecks:
- Too many metrics → cardinality explosion
- Unbounded log ingestion → storage cost + delays
Mitigations:
- Use structured logging + log levels per service
- Push logs through Kafka + FluentD to a central store (e.g., S3, Loki)
- Pre-aggregate metrics and alerts (e.g., 99p latency)
- Use circuit breakers + alerting (e.g., via Prometheus + AlertManager)
🔄 General Cross-Service Bottlenecks
Scenarios:
- N+1 service calls
- Global mutex behaviour (e.g., updating trending counts)
- Thundering herd problem (cache miss for trending video)
Mitigations:
- Use batch APIs or data loaders
- Asynchronous eventual counters for analytics
- Employ request coalescing or lock-striping for high-contention updates
- Use cache stampede protection (e.g., request deduplication with singleflight pattern)