My Solution for Design Facebook Messenger with Score: 9/10
by dewdrop_pinnacle261
System requirements
Functional Requirements
- Send/receive instant messages (1:1 and group chats)
- Read receipts, delivery receipts
- User online/offline presence
- Push notifications
- User blocking and privacy controls
- Media sharing (images, video, audio)
- End-to-end encryption (E2EE)
- Message reactions
- Message search and history
✅ Non-Functional Requirements
- High availability (99.99%)
- Low latency (<200ms per message)
- Scalability to 1B+ users
- Global deployment with multi-region support
- Strong consistency where required (friendship, blocking), eventual consistency for messages
- Secure authentication and authorization
- Observability and monitoring
- Graceful degradation under partial failure
Capacity estimation
- Daily active users (DAU): 1B
- Peak concurrent users: ~100M
- Avg messages/user/day: 50 → ~50B messages/day → ~600K/sec peak
- Message size: ~1 KB → ~600MB/sec data
- Storage for 30 days → 50B * 1KB * 30 ≈ 1.5PB
- Presence updates/sec: ~10M/sec (with batching)
API design
POST /messages/send
GET /messages/history?conversation_id
POST /messages/read
POST /users/block
POST /users/unblock
GET /users/presence
POST /reactions/add
POST /reactions/remove
Auth: OAuth2, JWT
Rate limiting and idempotency keys
WebSocket APIs:
connect
sendMessage
receiveMessage
updatePresence
Database design
SQL (PostgreSQL, MySQL) → users, friendships, block lists
NoSQL (Cassandra, DynamoDB) → messages, reactions, presence
Redis → online presence, unread counters, ephemeral data
Blob storage (S3, GCS) → media files
ER Model core tables:
- User (user_id, name, profile, settings)
- Conversation (conversation_id, participants, type)
- Message (message_id, conversation_id, sender_id, body, ts, status)
- Reaction (reaction_id, message_id, user_id, type)
- BlockList (blocker_id, blocked_id)
- Presence (user_id, online_status, last_active)
High-level design
API Gateway
Authentication Service
User Service
Conversation Service
Message Service
Notification Service
Presence Service
Media Service (file storage, CDN)
WebSocket Gateway / Message Broker
Database Cluster (SQL for metadata, NoSQL for message storage)
Cache Layer (Redis, Memcached)
Search Engine (Elasticsearch for message search)
Request flows
Send message:
- Client → API Gateway → Authentication.
- → Message Service → Store in DB.
- → Push to Message Broker → Deliver to recipient(s).
- → Update read/delivery status.
- → Notify Notification Service.
Receive message:
- WebSocket pushes message to client.
- Client updates UI.
- On read, client → API → Message Service → Update DB.
Detailed component design
Message Service
- Uses Kafka/Pulsar to decouple write path
- Stores messages in Cassandra (sharded by conversation_id)
- Uses idempotency keys to deduplicate
✅ Presence Service
- Stores presence in Redis (TTL ~1 min)
- Uses Redis Pub/Sub for updates
- Fallback to “last seen” in SQL
✅ WebSocket Gateway
- Horizontally scalable
- Sticky sessions or token-based reconnect
- Session state stored in Redis or shared store
Trade offs/Tech choices
NoSQL over SQL for messages: High write throughput, horizontal scale
SQL for user data: Strong relations, ACID
Redis for presence: Low latency, fast expiration
WebSockets over polling: Lower latency, less overhead
Kafka decoupling: Prevents producer overload during spikes
Failure scenarios/bottlenecks
- WebSocket gateway crash → reconnect using tokens
- DB partition loss → use quorum reads/writes, retry
- Kafka overload → apply backpressure, shed non-critical traffic
- Redis failover → use Redis Sentinel or clustered Redis
- Network partitions → apply retries, buffer on client, reconcile on reconnect
Future improvements
Video/audio calls → use TURN/STUN servers, optimize for UDP
ML for spam/abuse detection
Dynamic scaling → auto-scale WebSocket and message layers
Cross-region replication → use CRDTs or conflict-free strategies
Optimize Redis operations → fine-tune pub/sub channels, shard presence data