My Solution for Design Facebook Messenger with Score: 9/10

by dewdrop_pinnacle261

System requirements


Functional Requirements

  • Send/receive instant messages (1:1 and group chats)
  • Read receipts, delivery receipts
  • User online/offline presence
  • Push notifications
  • User blocking and privacy controls
  • Media sharing (images, video, audio)
  • End-to-end encryption (E2EE)
  • Message reactions
  • Message search and history

Non-Functional Requirements

  • High availability (99.99%)
  • Low latency (<200ms per message)
  • Scalability to 1B+ users
  • Global deployment with multi-region support
  • Strong consistency where required (friendship, blocking), eventual consistency for messages
  • Secure authentication and authorization
  • Observability and monitoring
  • Graceful degradation under partial failure




Capacity estimation

  • Daily active users (DAU): 1B
  • Peak concurrent users: ~100M
  • Avg messages/user/day: 50 → ~50B messages/day → ~600K/sec peak
  • Message size: ~1 KB → ~600MB/sec data
  • Storage for 30 days → 50B * 1KB * 30 ≈ 1.5PB
  • Presence updates/sec: ~10M/sec (with batching)






API design

POST /messages/send

GET /messages/history?conversation_id

POST /messages/read

POST /users/block

POST /users/unblock

GET /users/presence

POST /reactions/add

POST /reactions/remove


Auth: OAuth2, JWT

Rate limiting and idempotency keys

WebSocket APIs:

  • connect
  • sendMessage
  • receiveMessage
  • updatePresence





Database design

SQL (PostgreSQL, MySQL) → users, friendships, block lists

NoSQL (Cassandra, DynamoDB) → messages, reactions, presence

Redis → online presence, unread counters, ephemeral data

Blob storage (S3, GCS) → media files


ER Model core tables:

  • User (user_id, name, profile, settings)
  • Conversation (conversation_id, participants, type)
  • Message (message_id, conversation_id, sender_id, body, ts, status)
  • Reaction (reaction_id, message_id, user_id, type)
  • BlockList (blocker_id, blocked_id)
  • Presence (user_id, online_status, last_active)





High-level design

API Gateway

Authentication Service

User Service

Conversation Service

Message Service

Notification Service

Presence Service

Media Service (file storage, CDN)

WebSocket Gateway / Message Broker

Database Cluster (SQL for metadata, NoSQL for message storage)

Cache Layer (Redis, Memcached)

Search Engine (Elasticsearch for message search)





Request flows

Send message:

  1. Client → API Gateway → Authentication.
  2. → Message Service → Store in DB.
  3. → Push to Message Broker → Deliver to recipient(s).
  4. → Update read/delivery status.
  5. → Notify Notification Service.

Receive message:

  1. WebSocket pushes message to client.
  2. Client updates UI.
  3. On read, client → API → Message Service → Update DB.





Detailed component design

Message Service

  • Uses Kafka/Pulsar to decouple write path
  • Stores messages in Cassandra (sharded by conversation_id)
  • Uses idempotency keys to deduplicate

Presence Service

  • Stores presence in Redis (TTL ~1 min)
  • Uses Redis Pub/Sub for updates
  • Fallback to “last seen” in SQL

WebSocket Gateway

  • Horizontally scalable
  • Sticky sessions or token-based reconnect
  • Session state stored in Redis or shared store






Trade offs/Tech choices

NoSQL over SQL for messages: High write throughput, horizontal scale

SQL for user data: Strong relations, ACID

Redis for presence: Low latency, fast expiration

WebSockets over polling: Lower latency, less overhead

Kafka decoupling: Prevents producer overload during spikes





Failure scenarios/bottlenecks

  • WebSocket gateway crash → reconnect using tokens
  • DB partition loss → use quorum reads/writes, retry
  • Kafka overload → apply backpressure, shed non-critical traffic
  • Redis failover → use Redis Sentinel or clustered Redis
  • Network partitions → apply retries, buffer on client, reconcile on reconnect





Future improvements

Video/audio calls → use TURN/STUN servers, optimize for UDP

ML for spam/abuse detection

Dynamic scaling → auto-scale WebSocket and message layers

Cross-region replication → use CRDTs or conflict-free strategies

Optimize Redis operations → fine-tune pub/sub channels, shard presence data