System requirements
Functional:
- User Authentication and Registration: User's should be able to sign up into the platfom, and log in to their accoun using their respective set of credentials
- Search Function: User's should be able to search events, Threaters and Shows based on their Zip codes, city, or even by using the event's name.
- Movie and Event Display: Based around their Zip code city, user's hould be prompted with a set of events to choose from
- Review Event Information: User's hsould be able to review the information related to the shows that they selected, including host, showtime, etc...
- Select Seats: User's should be able to select their seat prefernce on a given show as well as being able to see what seats are available
- Process Payments: Application should integrate a payment gateway to securely handle payments
- Cancellation and Refunds: The system should allow user's to cancell previous booking nand process the refunds,
- User Notification: System shoul incorporate a notification system to send user's reminders about events or any form of changes related to their bookings
- Customer Support: Provide a help center to help customer troubleshoot their internal issues
Non-Functional:
- Availabilty: Users should be able too access their profile and the general platform across the globe
- Scalability: System should be able to grow and shrink to meet user's demands, and handle large number of concurrent requests
- Performance: User Feed and Profile should be displayed under 800 milliseconds
- Reliabilty: Application should be fault tolerant and meet the 99.999% rule, meaning it should be down for less then 30 minutes per year
- Security: User's information should be safely stored at rest and in transit. Respect access control and authorization best practices
- Consistency: System shoud ensure double booking occurences are not possible
- Maintanaibility: System should be design with minimal complexity, to make it easy to implement future choices
- Fault Tolerance: System should be designed to continue working in the eventuality that some component of the system would fail
Capacity estimation
Each Read Request = 500Bytes
Each Write Request = 1KB
With 5000 concurrent request per seconds
15.768 Billions Request Per year
15.768 GB of data for request processing
Data per Ticket = 10KB
Data Per Account = 2MB
On Average 1 customer processes 3 tickets/day
We estimate our initial deployment with 5000 users
15000 tickets processed per day, (150000KB/day)
On a yearly basis representing up to 547 500 000 KB/Year
with 5000 user profile totaling 10000MB/day
or 3650000MB /year
Total Data Required AROUND 50GB
API design
Post /SignUp
POST /Login
Post /SignOut
Get /Account&account_id=
DELETE /account&account_id =
Get /Movies
Get /Movies/ID/Details
Get /Movies/ID/Seats
Get /Event
Get /Event/ID/Details
Get /Event/ID/Seats
Get /Theater&Zipcode=
Get /Theater/Movies
Post /booking&Event_ID=
Post /cancel&Event_ID=
Post /refund&booking_ID=
Post /new_movie
Post /new_event
Post /Movies/List
Post /Support/ID=
GET /notifications/user_id
POST /movie/user_id/notify
Get /analytics
Database design
Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...
High-level design
You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...
Request flows
Let’s say a user wants to book a ticket for an event via a web or mobile client.
1. Request Initialization
- The Client (C1) makes a request (e.g., viewing an event or booking a seat).
- This request is first routed through the CDN (C), which may serve cached static content or forward dynamic requests.
2. API Gateway and Load Balancing
- The request is forwarded to the API Gateway (APG) where:
- Routing (R1) directs the request to the right service.
- Payload Compression (P1) optimizes the request data.
- Authentication (A1) checks user credentials.
- The request passes through the Load Balancer (L1) to distribute traffic to available service instances.
3. API Service Layer
- It reaches the central API service node (A) that delegates the request to necessary services:
- To get event details → Event Service (E1)
- To show feed or promotions → Feed Service (F1)
- To search events → Search Service (S1)
4. Backend Coordination
- Event Service (E1) may pull data from:
- Event Cache (E2) for fast retrieval.
- Event Database (D1) for persistent storage.
- Seat Service (S2) helps check seat availability linked to events.
- Rating Service (R2) provides user ratings for events.
5. Ticket Booking
- Once the user selects a seat:
- Ticket Service (T) creates a ticket.
- Booking Service (B1) initiates the booking process.
- Payment Service (P) processes payment:
- Communicates with Payment Gateway (PG).
- Logs to Payment Backlog (PB) for reconciliation.
- Notification Service (NS) sends confirmation to the user.
6. Finalization & Feedback Loop
- If it's a new event submission:
- API Write (Ap2) → Event Insertion Service (EIS) → Virtual Queue (MQ1) → Event Service (E1) updates.
- User Service (U) and Subscription Service (S3) track engagement.
- Feedback (ratings, subscriptions) enhance personalization via Feed Service (F1) and Search Service (S1).
Detailed component design
Event Service (E1)
Responsibilities
- Stores and manages metadata for events: title, description, date/time, venue, etc.
- Handles real-time updates via the Event Insertion Service and Virtual Queue.
- Exposes endpoints for querying events (via API Read) and creating new ones (via API Write).
Scalability
- Reads scale using Event Cache (E2) — a distributed in-memory store like Redis or Memcached.
- Writes go through Virtual Queue (MQ1) to decouple and throttle event ingestion, allowing eventual consistency and resilience to spikes.
Algorithms & Structures
- Indexing with B-Trees or LSM Trees in databases for efficient range queries.
- Time-based partitioning to segment events (e.g., by month/year) — improves query performance.
- Concurrency control via optimistic locking to avoid conflicts when updating event records.
Payment Service (P)
Responsibilities
- Interfaces with external Payment Gateway (PG).
- Logs transactions in the Payment Backlog (PB).
- Integrates with Booking Service to finalize purchases.
Scalability
- Idempotent APIs: Designed so retries (e.g., due to timeouts) don’t double-charge users.
- Message queues (e.g., RabbitMQ or Kafka) are used to persist and retry failed transactions.
- Sharded payment logs and eventually consistent updates using a transactional outbox pattern.
Algorithms & Structures
- Merkle Trees or hash logs for fraud detection / audit trail validation.
- Rate limiting & circuit breakers to isolate faulty 3rd-party gateway behavior.
Search Service (S1)
Responsibilities
- Indexes and enables full-text + metadata-based search for events, tickets, users.
- Integrates results from Event Cache and Feed Service.
Scalability
- Backed by a distributed search engine like Elasticsearch or OpenSearch.
- Can scale horizontally — adding new nodes to expand capacity and improve availability.
Algorithms & Structures
- Inverted index for full-text search.
- TF-IDF / BM25 scoring to rank relevance.
- Caching + Bloom Filters to optimize common or popular querie
Trade offs/Tech choices
1. API Gateway and Load Balancer
Choice:
We placed an API Gateway in front of a Load Balancer to control, route, and throttle traffic before it reaches microservices.
Pros:
- Centralized authentication, compression, and routing.
- Improves observability and monitoring (e.g., metrics, rate-limiting).
- Allows smoother rollout of new services or A/B testing.
Trade-off:
- Adds latency overhead from request inspection and transformation.
- Requires careful versioning of APIs and tight coordination between teams.
- Why: The added control and security at the gateway layer are worth the tradeoff in latency, especially when handling high-throughput user-facing traffic.
2. Caching with Redis (Event Cache, Query Cache)
Choice:
Used Redis/Memcached as a distributed cache in Event Service and Search Service to serve frequently accessed data.
Pros:
- Reduces read latency from DB (~10–100x faster).
- Handles read-heavy workloads more gracefully.
Trade-off:
- Possible cache inconsistency (stale data) in case of write-heavy operations.
- Adds complexity in cache invalidation logic.
- Why: Since most user traffic involves reading events or search results, caching improves user experience without scaling database reads aggressively.
3. Asynchronous Processing with Queues (Virtual Queue, Retry Queue)
Choice:
We introduced queues (e.g., Kafka/RabbitMQ) for event ingestion and payment retries.
Pros:
- Decouples slow processes from the user experience.
- Handles load spikes better with backpressure.
- Makes the system more resilient to temporary failures.
Trade-off:
- More complexity: messages must be idempotent and durable.
- Delayed consistency — not ideal for time-critical flows.
- Why: Queues let us scale components independently, especially write-intensive ones like event insertion and payments.
4. Use of Distributed Search Engine (Elasticsearch)
Choice:
Used Elasticsearch (or OpenSearch) for Search Service to enable fast, ranked, and fuzzy search queries.
Pros:
- Inverted indexing offers blazing fast search performance.
- Supports relevance scoring, autocomplete, and filtering.
Trade-off:
- Data duplication — events need to be stored in both main DB and search index.
- Requires managing index sync and background workers.
- Why: The benefits of personalized and ranked search results outweigh the cost of managing another data pipeline.
5. Separate Read & Write APIs (Ap1 / Ap2)
Choice:
Split API responsibilities into Read (Ap1) and Write (Ap2) APIs.
Pros:
- Independent scaling — read APIs can scale aggressively without impacting write latency.
- Improves fault isolation — one failure won’t take down both.
Trade-off:
- More services = more deployment/configuration overhead.
- Requires client awareness of read vs write endpoints
Failure scenarios/bottlenecks
General Bottlenecks:
| ComponentBottleneck ExampleResolution Strategy | ||
| Event DB | Write contention on same event_id | Partitioning / DB queueing |
| Notification | Rate limits on email/SMS APIs | Use queues + batch notifications |
| Payment Service | Synchronous dependency on Payment Gateway | Move to async with status callbacks |
| Feed/Search | Sudden traffic from trending events | Auto-scale replicas + cache hot results |
Failure Scenario Example (Booking Surge):
- A viral event launches.
- 10K users hit API Gateway → Load Balancer → Booking Service.
- Event DB experiences contention on
available_seats. - Payment Gateway has timeout errors, users retry.
- Virtual Queue buffers overflowed with booking tasks.
- Some seat bookings fail silently → User dissatisfaction.
1. Eventual Consistency via Event-Driven Architecture
- What: Replace some synchronous service-to-service calls (e.g., booking → payment → notification) with event-driven communication (e.g., Kafka, NATS, or Amazon SNS/SQS).
- Why: Increases fault tolerance and decouples services.
- Failure Mitigation: If a downstream service fails, messages stay in the queue and are retried independently.
2. Distributed Locking for Booking Logic
- What: Introduce Redis-based distributed locks (e.g., Redlock) to manage concurrent seat booking.
- Why: Avoids race conditions/overselling tickets in high concurrency environments.
- Failure Mitigation: Ensures atomic operations for inventory control under peak traffic.
3. Chaos Engineering
- What: Introduce tools like Gremlin or Chaos Monkey to simulate service crashes, latency, etc.
- Why: Proactively identify system weaknesses before they hit production.
- Failure Mitigation: Catch cascading failures early and build resilience.
4. Auto-Scheduling + Load Prediction with ML
- What: Use past event data to predict traffic surges and auto-scale resources (compute, cache, DB replicas) beforehand.
- Why: Helps prepare for peak events, reduces latency and cost.
- Failure Mitigation: Prevents DB/cache bottlenecks under load.
5. Geo-Redundancy & Multi-Region Deployments
- What: Deploy services in multiple regions, with geo-aware DNS routing.
- Why: Protects against region-wide cloud outages.
- Failure Mitigation: Keeps core features up even if a region fails.
6. Decouple Notification System with Fallbacks
- What: Retry on multiple providers (e.g., SendGrid → Mailgun → SES) and store failed messages for delayed processing.
- Why: Improves delivery reliability.
- Failure Mitigation: No lost or missed critical alerts (like ticket confirmation).
7. Global Rate Limiting and Throttling
- What: Implement global token bucket or leaky bucket rate limiting across users, IPs, and endpoints.
- Why: Prevent system overload and abuse.
- Failure Mitigation: Gracefully sheds load during traffic spikes.
8. Dedicated Write Service with Idempotency
- What: Use a write-ahead service to queue and validate write requests, ensuring deduplication via idempotency keys.
- Why: Prevents data duplication, especially under retries.
- Failure Mitigation: Avoids issues like double billing or duplicate bookings.
9. Monitoring, Alerting & Dashboards
- What: Use tools like Prometheus + Grafana, ELK stack, and Sentry for deep observability.
- Why: Detect and resolve issues faster.
- Failure Mitigation: Prevents silent failures; alerts you before users complain.