Codemia | Master System Design Interviews Through Active Practice

My Solution for Design a Video Conferencing system with Score: 8/10

by iridescent_luminous693

System requirements

Functional:

User Management:

Allow users to register, log in, and manage profiles.
Support role-based permissions (e.g., host, co-host, participant).

Meeting Scheduling:

Create, edit, and delete meetings.
Send invitations via email or in-app notifications.
Generate unique meeting IDs and secure links.

Joining Meetings:

Enable users to join meetings via web or mobile applications.
Support password-protected and waiting room features for security.

Audio/Video Communication:

Provide high-quality audio and video streaming.
Include options to mute/unmute audio and enable/disable video.

Content Sharing:

Allow screen sharing, file sharing, and whiteboarding during meetings.
Support annotation tools for collaborative editing.

Chat and Messaging:

Enable text chat within meetings (private and group chats).
Provide emoji reactions and file sharing via chat.

Recording and Playback:

Allow hosts to record meetings and save recordings to cloud storage.
Provide playback and download options for participants.

Breakout Rooms:

Allow splitting a meeting into smaller group discussions.
Enable hosts to manage and switch participants between rooms.

Meeting Controls:

Allow hosts to manage participants (e.g., remove, mute).
Provide polling, Q&A, and other engagement features.

Notifications and Reminders:

Send reminders for scheduled meetings.
Notify participants of changes to meeting details.

Analytics and Reporting:

Provide meeting insights (e.g., attendance, duration, engagement metrics).
Offer usage reports for admins.

Integration:

Support calendar integrations (e.g., Google Calendar, Outlook).
Provide APIs for third-party integration.

Non-Functional:

Performance:

Ensure low-latency audio/video streaming (<200ms).
Scale to support thousands of concurrent users per meeting.

Reliability:

Ensure 99.99% uptime with robust failover mechanisms.
Handle transient network issues gracefully.

Scalability:

Support millions of active users and large-scale meetings with up to 1,000 participants.
Use auto-scaling for servers during high demand.

Security:

Use end-to-end encryption for meeting data.
Implement secure authentication and token-based access to meetings.
Protect recordings with role-based access controls.

Usability:

Provide intuitive user interfaces for all platforms (web, mobile, desktop).
Ensure accessibility for users with disabilities (e.g., screen reader support).

Compatibility:

Support cross-platform functionality on all major browsers and operating systems.
Provide seamless integration with popular tools like Slack, Teams, and CRMs.

Data Privacy:

Comply with GDPR, CCPA, and other regional privacy laws.
Allow users to control and export their data.

Fault Tolerance:

Use distributed servers to minimize the impact of hardware failures.
Implement retry mechanisms for dropped connections.

Maintainability:

Use modular design for ease of updates and feature additions.
Provide detailed logs and monitoring for issue diagnosis.

Cost Efficiency:

Optimize resource usage to minimize operational costs.
Use cloud services for scalable and flexible infrastructure.

Capacity estimation

1. User Metrics

Monthly Active Users (MAU): 100 million.
Daily Active Users (DAU): 10 million (~10% of MAU).
Concurrent Users:
- At peak, 1 million users could be on video calls simultaneously.
- Each meeting may have an average of 10 participants.

2. Meetings

Meetings per Day:
- Assuming each user joins 2 meetings daily: 10 million DAU×2=20 million meetings/day10 \, \text{million DAU} \times 2 = 20 \, \text{million meetings/day}10million DAU×2=20million meetings/day
Concurrent Meetings:
- At peak, assuming 10% of daily meetings occur: 20 million meetings/day×0.1=2 million concurrent meetings20 \, \text{million meetings/day} \times 0.1 = 2 \, \text{million concurrent meetings}20million meetings/day×0.1=2million concurrent meetings

3. Audio/Video Streaming

Participants per Meeting:
- Average of 10 participants per meeting.
Stream Data:
- Each participant streams:
  - Video: 1 Mbps.
  - Audio: 100 Kbps.
- Total bandwidth per meeting: 10 participants×(1 Mbps video+100 Kbps audio)=11 Mbps10 \, \text{participants} \times (1 \, \text{Mbps video} + 100 \, \text{Kbps audio}) = 11 \, \text{Mbps}10participants×(1Mbps video+100Kbps audio)=11Mbps
- Total bandwidth at peak: 2 million meetings×11 Mbps=22 Tbps2 \, \text{million meetings} \times 11 \, \text{Mbps} = 22 \, \text{Tbps}2million meetings×11Mbps=22Tbps

4. Storage

Recording:
- Assume 20% of meetings are recorded.
- Average meeting duration: 1 hour.
- Recording size (compressed):
  - Video: 500 MB/hour.
  - Audio: 50 MB/hour.
- Total recording storage per day: 20 million meetings×0.2 recorded×(500 MB video+50 MB audio)=2.2 PB/day20 \, \text{million meetings} \times 0.2 \, \text{recorded} \times (500 \, \text{MB video} + 50 \, \text{MB audio}) = 2.2 \, \text{PB/day}20million meetings×0.2recorded×(500MB video+50MB audio)=2.2PB/day
- Monthly storage requirement: 2.2 PB/day×30=66 PB/month2.2 \, \text{PB/day} \times 30 = 66 \, \text{PB/month}2.2PB/day×30=66PB/month

5. Notifications

Meeting Notifications:
- Each user receives 5 notifications daily (e.g., invites, reminders): 10 million DAU×5=50 million notifications/day10 \, \text{million DAU} \times 5 = 50 \, \text{million notifications/day}10million DAU×5=50million notifications/day
- Notifications per second (peak): 50 million notifications×0.05 peak hour÷3600=694 NPS (notifications per second)50 \, \text{million notifications} \times 0.05 \, \text{peak hour} \div 3600 = 694 \, \text{NPS (notifications per second)}50million notifications×0.05peak hour÷3600=694NPS (notifications per second)

6. Chat Messages

Messages per User:
- Assume each user sends 10 messages during a meeting.
- Total messages per day: 10 million DAU×10=100 million messages/day10 \, \text{million DAU} \times 10 = 100 \, \text{million messages/day}10million DAU×10=100million messages/day
Peak Message Traffic:
- At peak, 10% of messages are sent in one hour: 100 million messages/day×0.1÷3600=2,777 messages/second (MPS)100 \, \text{million messages/day} \times 0.1 \div 3600 = 2,777 \, \text{messages/second (MPS)}100million messages/day×0.1÷3600=2,777messages/second (MPS)

7. Search

Search Queries:
- 10% of users perform 2 searches daily: 10 million DAU×0.1 searching users×2=2 million searches/day10 \, \text{million DAU} \times 0.1 \, \text{searching users} \times 2 = 2 \, \text{million searches/day}10million DAU×0.1searching users×2=2million searches/day
Peak Search Traffic:
- At peak, 10% of searches are performed in one hour: 2 million searches/day×0.1÷3600=56 queries/second2 \, \text{million searches/day} \times 0.1 \div 3600 = 56 \, \text{queries/second}2million searches/day×0.1÷3600=56queries/second

8. Analytics

Metrics Collected:
- Track meeting duration, user activity, and bandwidth usage.
- Assume 10 metrics per meeting: 20 million meetings/day×10=200 million metrics/day20 \, \text{million meetings/day} \times 10 = 200 \, \text{million metrics/day}20million meetings/day×10=200million metrics/day
Real-Time Metrics:
- 1% of metrics are queried in real-time: 200 million×0.01=2 million real-time metrics/day200 \, \text{million} \times 0.01 = 2 \, \text{million real-time metrics/day}200million×0.01=2million real-time metrics/day

9. Bandwidth Usage

Audio/Video Streaming:
- Total bandwidth (peak): 22 Tbps (audio/video streaming)+additional for file sharing/screen sharing22 \, \text{Tbps (audio/video streaming)} + \text{additional for file sharing/screen sharing}22Tbps (audio/video streaming)+additional for file sharing/screen sharing
Screen Sharing:
- Assume 10% of users share screens at 2 Mbps.
- Additional bandwidth: 100,000 users×2 Mbps=200 Gbps100,000 \, \text{users} \times 2 \, \text{Mbps} = 200 \, \text{Gbps}100,000users×2Mbps=200Gbps

API design

1. User Management APIs

POST /users/register
- Description: Register a new user with email, password, and profile details.
POST /users/login
- Description: Authenticate user credentials and issue a JWT for session management.
GET /users/profile
- Description: Retrieve the logged-in user’s profile details.
PUT /users/profile
- Description: Update the user’s profile information (e.g., name, profile picture).
POST /users/logout
- Description: Log out the user and invalidate the session.

2. Meeting Scheduling APIs

POST /meetings/schedule
- Description: Create a new meeting with details (e.g., time, participants).
PUT /meetings/{meetingId}/edit
- Description: Update meeting details (e.g., reschedule, change participants).
DELETE /meetings/{meetingId}
- Description: Cancel a scheduled meeting.
GET /meetings/upcoming
- Description: Retrieve a list of upcoming meetings for the user.
GET /meetings/{meetingId}/details
- Description: Get details of a specific meeting, including participants and agenda.

3. Joining and Managing Meetings

POST /meetings/{meetingId}/join
- Description: Join an ongoing meeting using a meeting ID or secure link.
POST /meetings/{meetingId}/leave
- Description: Leave an ongoing meeting.
POST /meetings/{meetingId}/control
- Description: Allow hosts to manage participants (e.g., mute, remove).

4. Audio/Video Management APIs

POST /meetings/{meetingId}/audio/toggle
- Description: Mute or unmute the user’s microphone.
POST /meetings/{meetingId}/video/toggle
- Description: Enable or disable the user’s video stream.
POST /meetings/{meetingId}/screen-share/start
- Description: Start sharing the user’s screen in the meeting.
POST /meetings/{meetingId}/screen-share/stop
- Description: Stop screen sharing.

5. Chat and Collaboration APIs

POST /meetings/{meetingId}/chat/send
- Description: Send a message in the meeting chat.
GET /meetings/{meetingId}/chat/messages
- Description: Retrieve chat messages from a meeting.
POST /meetings/{meetingId}/whiteboard/create
- Description: Create a collaborative whiteboard for a meeting.
PUT /meetings/{meetingId}/whiteboard/update
- Description: Update the content of the whiteboard.

6. Recording and Playback APIs

POST /meetings/{meetingId}/record/start
- Description: Start recording the meeting.
POST /meetings/{meetingId}/record/stop
- Description: Stop recording the meeting.
GET /recordings/{recordingId}/download
- Description: Download a specific meeting recording.
GET /recordings/{recordingId}/playback
- Description: Stream a specific meeting recording.

7. Notification APIs

POST /notifications/subscribe
- Description: Subscribe to meeting reminders and updates.
DELETE /notifications/unsubscribe
- Description: Unsubscribe from meeting notifications.
GET /notifications/list
- Description: Retrieve a list of notifications for the user.

8. Analytics APIs

GET /analytics/meetings/{meetingId}
- Description: Retrieve metrics for a specific meeting (e.g., duration, participant count).
GET /analytics/usage
- Description: Retrieve user-level statistics (e.g., total meetings hosted, duration spent in meetings).

9. Search APIs

GET /search/users
- Description: Search for users by name or email.
GET /search/meetings
- Description: Search for meetings by title or date.

10. Integration APIs

POST /integrations/calendar/sync
- Description: Sync meetings with the user’s calendar (e.g., Google Calendar).
POST /integrations/third-party
- Description: Integrate with third-party tools (e.g., Slack, Teams).

Database design

1. User Database

Purpose:
- To store and manage user profiles, credentials, and related preferences.
Database Details:
- Table: users
  - user_id (UUID, Primary Key): Unique identifier for each user.
  - name (VARCHAR): Full name of the user.
  - email (VARCHAR, Unique): User’s email address.
  - password_hash (VARCHAR): Encrypted password.
  - profile_picture_url (TEXT): URL for the user’s profile picture.
  - created_at (TIMESTAMP): Account creation timestamp.
Technology Used:
- PostgreSQL.
Reason:
- Relational data structure for managing profiles and ensuring data integrity.
- ACID compliance for secure and consistent user data storage.

2. Meeting Database

Purpose:
- To store meeting details, schedules, and participation records.
Database Details:
- Table: meetings
  - meeting_id (UUID, Primary Key): Unique identifier for each meeting.
  - title (VARCHAR): Meeting title or agenda.
  - start_time (TIMESTAMP): Scheduled start time.
  - end_time (TIMESTAMP): Scheduled end time.
  - meeting_link (VARCHAR): Unique meeting link.
  - host_id (UUID, Foreign Key): ID of the host user.
  - is_recorded (BOOLEAN): Indicates whether the meeting is recorded.
- Table: participation
  - participation_id (UUID, Primary Key): Unique participation record.
  - user_id (UUID, Foreign Key): User participating in the meeting.
  - meeting_id (UUID, Foreign Key): Meeting being attended.
  - role (ENUM): Role of the participant (e.g., host, co-host, participant).
Technology Used:
- PostgreSQL.
Reason:
- Relational database structure is ideal for managing linked data (e.g., meetings and participants).
- Supports complex queries like fetching meetings by participant or host.

3. Chat Database

Purpose:
- To store and manage chat messages within meetings.
Database Details:
- Table: chat_messages
  - message_id (UUID, Primary Key): Unique identifier for each message.
  - meeting_id (UUID, Foreign Key): Meeting in which the message was sent.
  - sender_id (UUID, Foreign Key): User who sent the message.
  - content (TEXT): The actual chat message.
  - sent_at (TIMESTAMP): Time the message was sent.
Technology Used:
- MongoDB.
Reason:
- NoSQL structure is optimized for high-velocity writes and unstructured chat data.
- Enables faster querying for chat histories during live meetings.

4. Recording Database

Purpose:
- To track and manage metadata of meeting recordings stored in cloud storage.
Database Details:
- Table: recordings
  - recording_id (UUID, Primary Key): Unique identifier for each recording.
  - meeting_id (UUID, Foreign Key): Meeting associated with the recording.
  - storage_url (TEXT): URL to the recording file in cloud storage.
  - created_at (TIMESTAMP): Recording creation timestamp.
  - format (ENUM): Recording format (e.g., MP4, MP3).
Technology Used:
- PostgreSQL.
Reason:
- Relational database ensures metadata integrity and seamless linkage to meeting records.
- Suitable for transactional queries like fetching recordings for specific meetings.

5. Notification Database

Purpose:
- To store and manage notifications sent to users about meetings and updates.
Database Details:
- Table: notifications
  - notification_id (UUID, Primary Key): Unique identifier for each notification.
  - user_id (UUID, Foreign Key): User receiving the notification.
  - meeting_id (UUID, Foreign Key): Associated meeting (if applicable).
  - message (TEXT): Notification content.
  - sent_at (TIMESTAMP): Timestamp when the notification was sent.
  - is_read (BOOLEAN): Indicates if the notification has been read.
Technology Used:
- Redis.
Reason:
- In-memory storage ensures low-latency access for real-time notifications.
- Built-in TTL allows automatic expiration of stale notifications.

6. Analytics Database

Purpose:
- To track and process meeting and user engagement metrics.
Database Details:
- Table: meeting_metrics
  - metric_id (UUID, Primary Key): Unique identifier for each metric.
  - meeting_id (UUID, Foreign Key): Meeting associated with the metric.
  - metric_type (ENUM): Type of metric (e.g., duration, engagement).
  - value (FLOAT): Metric value.
  - timestamp (TIMESTAMP): Time when the metric was recorded.
- Table: user_metrics
  - metric_id (UUID, Primary Key): Unique identifier for each metric.
  - user_id (UUID, Foreign Key): User associated with the metric.
  - metric_type (ENUM): Type of metric (e.g., time spent, meetings attended).
  - value (FLOAT): Metric value.
  - timestamp (TIMESTAMP): Time when the metric was recorded.
Technology Used:
- DynamoDB for real-time metrics and PostgreSQL for historical data.
Reason:
- DynamoDB’s high write throughput supports real-time data ingestion.
- PostgreSQL ensures consistency and relational queries for historical analysis.

7. Media Storage

Purpose:
- To store multimedia files (e.g., recordings, shared files) associated with meetings.
Database Details:
- No traditional database table.
- Media files are stored in AWS S3.
- Metadata is tracked in the Recording Database.
Technology Used:
- AWS S3.
Reason:
- Scalable and cost-effective solution for storing large media files.
- Provides high availability and secure access through pre-signed URLs.

High-level design

1. Client Applications

Description: The user-facing interfaces through which participants interact with the system. These can be web, mobile, or desktop applications.
Responsibilities:
- Allow users to register, log in, schedule, join, and participate in meetings.
- Stream audio and video, share screens, and send chat messages.
- Display meeting notifications and playback recorded sessions.
Technologies: React, React Native, WebRTC, and other media streaming protocols.

2. API Gateway

Description: Acts as the entry point for all client requests. It routes requests to appropriate backend services.
Responsibilities:
- Ensures secure communication between clients and backend services.
- Manages routing, load balancing, rate-limiting, and authentication.
- Centralizes monitoring, logging, and error handling.
Technologies: AWS API Gateway, Nginx, or HAProxy.

3. Authentication Service

Description: Manages user authentication and session handling.
Responsibilities:
- Verifies user credentials and issues JWT tokens for session management.
- Stores user credentials securely and allows users to reset their passwords.
- Manages user roles (e.g., host, participant) and permissions.
Technologies: OAuth 2.0, JWT, bcrypt for password hashing, and Redis for session management.

4. Meeting Service

Description: Manages the creation, scheduling, and lifecycle of meetings.
Responsibilities:
- Allows users to schedule meetings, invite participants, and set meeting settings (e.g., password, waiting room).
- Manages meeting rooms and participant roles (host, co-host, participant).
- Handles meeting start, end, and real-time updates.
Technologies: Node.js, Express, PostgreSQL for meeting metadata, and Redis for real-time updates.

5. Chat Service

Description: Handles real-time text-based communication between meeting participants.
Responsibilities:
- Sends and receives chat messages in real-time.
- Allows private, group, and broadcast chats within meetings.
- Stores chat history and supports message search.
Technologies: WebSockets for real-time messaging, MongoDB for storing chat data, and Redis for caching frequently accessed messages.

6. Media Service

Description: Manages media files, including video streams, screen sharing, and recordings.
Responsibilities:
- Handles video/audio encoding/decoding, streaming, and recording.
- Uploads media files (recordings) to cloud storage (e.g., AWS S3).
- Manages media delivery to users with CloudFront for low-latency content delivery.
Technologies: WebRTC, AWS S3 for media storage, CloudFront CDN, FFmpeg for media processing.

7. Notification Service

Description: Sends notifications related to meetings and user activity.
Responsibilities:
- Sends real-time notifications (e.g., upcoming meetings, new messages).
- Handles push notifications for mobile or web users who are not actively connected.
- Manages email/SMS notifications for reminders and invites.
Technologies: Firebase Cloud Messaging (FCM), AWS SNS, Redis for caching notifications.

8. Analytics Service

Description: Tracks user and meeting data for real-time insights and reports.
Responsibilities:
- Collects and aggregates user engagement metrics (e.g., meeting attendance, duration, interactions).
- Provides historical reports and real-time analytics dashboards.
- Helps monitor system performance, meeting quality, and user behavior.
Technologies: Apache Kafka for event streaming, Apache Spark for batch processing, PostgreSQL and DynamoDB for storage.

9. Search Service

Description: Provides search capabilities for users to find meetings, messages, and users.
Responsibilities:
- Indexes meeting content (e.g., titles, descriptions) and chat messages for quick search.
- Allows searching through past meetings, messages, and users by various filters (e.g., keywords, date, or participants).
- Provides suggestions for users and meeting topics.
Technologies: Elasticsearch for indexing, Redis for caching search results.

10. Databases and Caching Layer

Databases:
- PostgreSQL: Stores structured data like user profiles, meeting details, and participation logs.
- MongoDB: Stores unstructured data such as chat messages and logs.
- DynamoDB: Stores real-time analytics data like active users and event streams.
- Elasticsearch: Indexes and allows full-text searching for users, messages, and meetings.
Caches:
- Redis: Caches frequently accessed data like user sessions, chat messages, notifications, and search results for fast retrieval.
CDN:
- CloudFront (AWS): Distributes media files and meeting recordings closer to users, reducing latency and bandwidth costs.

Request flows

User Login:

The client sends credentials (email/password) to the API Gateway.
The Authentication Service validates the credentials by querying the User Database (PostgreSQL) and returns a JWT token for session management.
The API Gateway responds with the token to the client.

Meeting Creation:

The client sends a meeting creation request (with meeting details) to the API Gateway.
The Meeting Service schedules the meeting, saves it to the Meeting Database (PostgreSQL), and returns the meeting details.
The API Gateway sends back the meeting link and confirmation to the client.

Joining a Meeting:

The client sends a request to join a meeting using the meeting ID and JWT token.
The API Gateway validates the token and fetches meeting details from the Meeting Database (PostgreSQL).
The Meeting Service responds with the meeting room and participant details, and the client joins.

Chat Messaging:

The client sends a chat message to the Chat Service.
The Chat Service stores the message in MongoDB and returns a message status.
The API Gateway responds to the client with a message confirmation.

Screen Sharing:

The client starts screen sharing, and the Media Service uploads the shared media to AWS S3.
The media URL is returned and shared with the client via the API Gateway.

Notifications:

The client requests notifications for meeting updates, and the Notification Service caches them in Redis.
The API Gateway confirms the notification status.

Search:

The client sends a search request for meetings, and the Search Service queries Elasticsearch for matching results.
The search results are returned to the client.

Analytics:

The client requests analytics data. The Analytics Service queries DynamoDB for real-time metrics and PostgreSQL for historical data.
The data is processed and returned to the client.

Logout:

The client sends a logout request to the API Gateway, which invalidates the JWT token with the Authentication Service.
The logout confirmation is sent to the client.

Detailed component design

1. Authentication Service

The Authentication Service plays a critical role in ensuring that only authorized users can access the video conferencing system. Upon receiving the login request with credentials (email and password), the service validates the data against the User Database (PostgreSQL). Once verified, it generates a JSON Web Token (JWT), which contains user-specific information (such as user ID, role, and permissions) that will be used in all subsequent requests to validate user identity and manage access. This JWT token is encrypted and has a set expiration time to increase security, ensuring that the session remains valid for a limited duration. The service also integrates with Redis, which stores active session data to allow fast lookups and token validation without querying the database repeatedly, improving performance and scalability.

For handling edge cases, the Authentication Service manages scenarios like expired or invalid tokens, ensuring that no unauthorized users can access restricted resources. When a token expires, the service forces the user to log in again to obtain a fresh token. If an invalid or tampered token is detected, the service immediately denies access, preventing unauthorized actions. Additionally, the service deals with multiple concurrent logins, ensuring secure session management across devices. If a user logs in from multiple devices, the service can manage separate sessions for each device or limit the number of concurrent logins, based on security policies. By ensuring token expiration and managing concurrent sessions, this service plays a pivotal role in maintaining system integrity and security.

2. Meeting Service

The Meeting Service is at the core of the system, responsible for scheduling, managing, and controlling meetings. When a user schedules a meeting, they provide essential information such as the title, start time, participants, and meeting settings (e.g., password protection, waiting rooms). This information is then stored in PostgreSQL, which provides a relational structure to manage the meeting's metadata efficiently. The service generates a unique meeting ID, which is used to create a secure link for participants to join. During the meeting lifecycle, the Meeting Service ensures smooth operation by handling features such as enabling/disabling participant audio or video, managing breakout rooms, and tracking participant activity. It also integrates with the Authentication Service to validate participant credentials before allowing entry into the meeting.

Edge cases are common in a dynamic meeting environment. One key issue is the potential for overlapping meetings, where two users attempt to schedule meetings at the same time, possibly leading to conflicts in room allocation or scheduling errors. The service ensures that meetings are scheduled correctly by checking for availability and suggesting alternative times if conflicts are detected. Additionally, there are instances where a meeting might be canceled or missed by participants. In such cases, the Meeting Service manages the meeting's status, updating participants and triggering automatic archiving or deletion of the meeting details. This ensures that the system remains organized and efficient, with canceled or no-show meetings cleaned up appropriately to save resources and storage.

3. Chat Service

The Chat Service enables real-time communication between meeting participants. It supports both group chat and private messaging, ensuring that users can communicate with each other efficiently during a meeting. When a user sends a message, the service immediately stores it in MongoDB, a NoSQL database designed to handle high-throughput, unstructured data like chat logs. MongoDB allows the service to scale horizontally as the number of messages increases, while maintaining fast read and write operations. To optimize performance, the service integrates with Redis, a fast in-memory caching system, to store frequently accessed messages and reduce the load on the database, providing low-latency message retrieval for active participants. Redis also ensures that chat history is cached and can be quickly fetched, especially for ongoing meetings where chat activity is high.

In terms of edge cases, one challenge is message duplication, which can occur due to network failures or accidental user actions, where the same message is sent multiple times. The Chat Service addresses this by implementing deduplication logic to detect and discard identical messages before storing them in the database. Another issue is large message sizes or attachments, which might exceed the system’s limits for a single document in MongoDB. To handle this, the service uses AWS S3 to store media files and GridFS for storing large messages, ensuring that even messages with media or extensive text are properly handled without overwhelming the database. These strategies ensure that the chat functionality remains fast, reliable, and scalable.

4. Media Service

The Media Service is responsible for managing all media-related operations, including real-time audio/video streaming, screen sharing, and meeting recordings. The service handles video streams from participants using WebRTC for low-latency, peer-to-peer communication. It ensures that audio and video data is transmitted securely and efficiently to all participants in the meeting. Additionally, the Media Service handles the recording of meetings, storing the media in AWS S3. Media files such as video recordings, screen shares, or audio files are uploaded using multipart uploads, which splits the media into smaller chunks, allowing for more reliable uploads, especially for large files. These media files are then cached in CloudFront CDN for faster retrieval by users across different geographical locations, improving the overall user experience.

Handling large media files can lead to edge cases such as network instability and large file uploads. In cases of network instability, where a user’s connection is intermittent or drops, WebRTC ensures that the video and audio streams are adaptive, adjusting the bitrate based on network conditions to maintain a stable connection. Additionally, large video recordings can pose challenges when trying to upload or stream them, especially with limited bandwidth. To address this, the Media Service compresses videos before storage and uses AWS S3’s multipart upload feature, which breaks the file into smaller parts, allowing for more efficient and reliable uploads. By employing these techniques, the service ensures that media content is delivered reliably, even in challenging conditions.

5. Notification Service

The Notification Service is responsible for keeping users informed by sending them timely updates regarding meetings, messages, and system events. It sends notifications for scheduled meetings, reminders, new messages, and other important updates. It integrates with Firebase Cloud Messaging (FCM) for push notifications to mobile and web clients and uses AWS SNS to send SMS and email notifications. The service caches notifications in Redis, allowing quick access to frequently viewed notifications. This caching ensures that notifications can be sent without querying the database for every message or alert. The service is designed to send notifications in real-time, as well as deliver periodic updates like daily summaries or meeting reminders.

Handling edge cases in notifications involves scenarios like missed notifications (when users are offline or disconnected) and overwhelming notification volume (which can flood users with excessive alerts). To address missed notifications, the system queues notifications and sends them once the user is back online, ensuring they don’t miss important updates. For overwhelming volumes, the Notification Service employs throttling mechanisms to limit the number of notifications sent per user in a given time period, prioritizing high-urgency alerts (e.g., meeting invitations) over non-urgent ones. Additionally, caching in Redis helps prevent unnecessary duplicate notifications and ensures that only unique, relevant notifications are delivered.

6. Analytics Service

The Analytics Service is responsible for gathering and processing real-time and historical data about user behavior, meeting activity, and system performance. It tracks key metrics such as meeting duration, participant engagement, and feature usage. This data is essential for providing insights into how the system is being used and identifying areas for improvement. The service uses Apache Kafka for event streaming, ensuring that events like meeting starts, participant joins, and messages sent are captured in real-time. These events are processed by Apache Spark, which aggregates data for reporting purposes. The results are stored in DynamoDB for fast, real-time access and in PostgreSQL for long-term storage of historical data.

Edge cases for analytics include scenarios like delayed data processing, where high volumes of real-time data might be temporarily delayed before being aggregated, and data skew, where some meetings or users generate disproportionate amounts of data. To address these issues, the service uses streaming data pipelines to handle high-throughput data and batch processing for periodic reporting. The system also implements sharding and partitioning to distribute data efficiently across multiple servers or clusters, ensuring that data ingestion remains fast and scalable. These techniques ensure that the Analytics Service can handle high data volumes without impacting performance, providing valuable insights in real time.

Trade offs/Tech choices

Choice of PostgreSQL vs MongoDB:
- Trade-off: Chose PostgreSQL for structured data like user information and meeting details, as it provides strong consistency and relational integrity. For unstructured chat data, MongoDB was chosen for its flexibility and ability to handle high-velocity, unstructured data.
- Reason: PostgreSQL is ideal for managing relational data and ensuring transactional consistency, while MongoDB is better for rapidly growing, schema-less data like chat logs.
Real-time Notifications vs Batch Processing:
- Trade-off: Chose Redis for caching notifications and ensuring low-latency delivery, while using AWS SNS for batch notification delivery (e.g., email/SMS).
- Reason: Redis offers quick access for real-time alerts, but batch processing tools like SNS handle mass notifications more efficiently.
WebRTC for Media vs Traditional Streaming:
- Trade-off: WebRTC was chosen for real-time, peer-to-peer video and audio communication instead of traditional server-based streaming.
- Reason: WebRTC reduces server load and latency by enabling direct peer-to-peer communication, which scales better for large meetings.
Caching vs Database Queries:
- Trade-off: Caching data in Redis (e.g., chat messages, notifications) versus querying the database for each request.
- Reason: Caching improves performance and reduces the load on databases, but requires careful cache invalidation to avoid serving outdated data.
ElasticSearch for Search vs Relational Database:
- Trade-off: Elasticsearch was used for search functionality over relational databases.
- Reason: Elasticsearch provides fast, scalable full-text search capabilities that relational databases would struggle to handle efficiently for large datasets.

These trade-offs were made to balance between performance, scalability, and complexity, ensuring the system could efficiently handle the expected load while maintaining flexibility and reliability.

Failure scenarios/bottlenecks

Database Overload:

Scenario: Heavy read or write traffic can overload PostgreSQL or MongoDB, causing slow response times.
Mitigation: Use replicas, sharding, and caching (e.g., Redis) to offload read traffic and distribute the load.

Network Latency:

Scenario: Poor network conditions can lead to high latency in media streaming (e.g., WebRTC).
Mitigation: Implement adaptive bitrate streaming to reduce quality during poor network conditions and use CDN (CloudFront) for media caching.

Real-Time Data Processing Delays:

Scenario: High volumes of real-time analytics or chat data can overwhelm Kafka or the processing service.
Mitigation: Use batch processing, scale Kafka consumers, and implement backpressure handling.

Notification Delivery Failures:

Scenario: Push notifications or email/SMS notifications may fail due to network issues or service downtime.
Mitigation: Implement retry mechanisms, store failed notifications, and deliver them when the user is back online.

Large Media Files:

Scenario: Uploading or downloading large media files can overwhelm S3 or slow down the user experience.
Mitigation: Use multipart uploads and compression to optimize storage and transfer.

Scaling Chat Service:

Scenario: With many users, the chat service might face high write traffic, leading to database bottlenecks.
Mitigation: Implement message deduplication and Redis caching for frequent queries, and use MongoDB's sharding for horizontal scalability.

Single Points of Failure:

Scenario: Critical services like the API Gateway or Authentication Service could fail, affecting the entire system.
Mitigation: Deploy services behind load balancers, use auto-scaling groups, and ensure high availability across multiple availability zones.

Future improvements

Improve Database Scalability:
- Future Improvement: Implement horizontal scaling and auto-sharding for databases to handle increased data loads.
- Mitigation: Use replication and distributed databases to distribute queries and reduce bottlenecks.
Enhance Media Streaming:
- Future Improvement: Integrate AI-powered network quality adjustments to improve streaming under varying network conditions.
- Mitigation: Use multi-path streaming to improve resilience against network issues and CDN optimization for faster media delivery.
Optimize Real-Time Data Processing:
- Future Improvement: Incorporate streaming data platforms like Apache Flink for real-time analytics.
- Mitigation: Implement data partitioning and scalable consumer groups in Kafka to prevent processing delays.
Improve Notification System:
- Future Improvement: Implement a priority queue for notifications to handle urgent messages more efficiently.
- Mitigation: Enhance retry logic and use geographically distributed notification servers for higher availability.
Reduce Latency in Chat Services:
- Future Improvement: Use event-driven architectures for more efficient message routing and processing.
- Mitigation: Use in-memory data grids (e.g., Hazelcast) to reduce reliance on the database for frequent queries.

These improvements would enhance system scalability, reduce latency, and ensure better fault tolerance.