Design Ticketmaster - System Design

System requirements

Functional:

User Authentication and Registration: User's should be able to sign up into the platfom, and log in to their accoun using their respective set of credentials
Search Function: User's should be able to search events, Threaters and Shows based on their Zip codes, city, or even by using the event's name.
Movie and Event Display: Based around their Zip code city, user's hould be prompted with a set of events to choose from
Review Event Information: User's hsould be able to review the information related to the shows that they selected, including host, showtime, etc...
Select Seats: User's should be able to select their seat prefernce on a given show as well as being able to see what seats are available
Process Payments: Application should integrate a payment gateway to securely handle payments
Cancellation and Refunds: The system should allow user's to cancell previous booking nand process the refunds,
User Notification: System shoul incorporate a notification system to send user's reminders about events or any form of changes related to their bookings
Customer Support: Provide a help center to help customer troubleshoot their internal issues

Non-Functional:

Availabilty: Users should be able too access their profile and the general platform across the globe
Scalability: System should be able to grow and shrink to meet user's demands, and handle large number of concurrent requests
Performance: User Feed and Profile should be displayed under 800 milliseconds
Reliabilty: Application should be fault tolerant and meet the 99.999% rule, meaning it should be down for less then 30 minutes per year
Security: User's information should be safely stored at rest and in transit. Respect access control and authorization best practices
Consistency: System shoud ensure double booking occurences are not possible
Maintanaibility: System should be design with minimal complexity, to make it easy to implement future choices
Fault Tolerance: System should be designed to continue working in the eventuality that some component of the system would fail

Capacity estimation

Each Read Request = 500Bytes

Each Write Request = 1KB

With 5000 concurrent request per seconds

15.768 Billions Request Per year

15.768 GB of data for request processing

Data per Ticket = 10KB

Data Per Account = 2MB

On Average 1 customer processes 3 tickets/day

We estimate our initial deployment with 5000 users

15000 tickets processed per day, (150000KB/day)

On a yearly basis representing up to 547 500 000 KB/Year

with 5000 user profile totaling 10000MB/day

or 3650000MB /year

Total Data Required AROUND 50GB

API design

Post /SignUp

POST /Login

Post /SignOut

Get /Account&account_id=

DELETE /account&account_id =

Get /Movies

Get /Movies/ID/Details

Get /Movies/ID/Seats

Get /Event

Get /Event/ID/Details

Get /Event/ID/Seats

Get /Theater&Zipcode=

Get /Theater/Movies

Post /booking&Event_ID=

Post /cancel&Event_ID=

Post /refund&booking_ID=

Post /new_movie

Post /new_event

Post /Movies/List

Post /Support/ID=

GET /notifications/user_id

POST /movie/user_id/notify

Get /analytics

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

Request flows

Let’s say a user wants to book a ticket for an event via a web or mobile client.

1. Request Initialization

The Client (C1) makes a request (e.g., viewing an event or booking a seat).
This request is first routed through the CDN (C), which may serve cached static content or forward dynamic requests.

2. API Gateway and Load Balancing

The request is forwarded to the API Gateway (APG) where:
- Routing (R1) directs the request to the right service.
- Payload Compression (P1) optimizes the request data.
- Authentication (A1) checks user credentials.
The request passes through the Load Balancer (L1) to distribute traffic to available service instances.

3. API Service Layer

It reaches the central API service node (A) that delegates the request to necessary services:
- To get event details → Event Service (E1)
- To show feed or promotions → Feed Service (F1)
- To search events → Search Service (S1)

4. Backend Coordination

Event Service (E1) may pull data from:
- Event Cache (E2) for fast retrieval.
- Event Database (D1) for persistent storage.
Seat Service (S2) helps check seat availability linked to events.
Rating Service (R2) provides user ratings for events.

5. Ticket Booking

Once the user selects a seat:
- Ticket Service (T) creates a ticket.
- Booking Service (B1) initiates the booking process.
- Payment Service (P) processes payment:
  - Communicates with Payment Gateway (PG).
  - Logs to Payment Backlog (PB) for reconciliation.
- Notification Service (NS) sends confirmation to the user.

6. Finalization & Feedback Loop

If it's a new event submission:
- API Write (Ap2) → Event Insertion Service (EIS) → Virtual Queue (MQ1) → Event Service (E1) updates.
User Service (U) and Subscription Service (S3) track engagement.
Feedback (ratings, subscriptions) enhance personalization via Feed Service (F1) and Search Service (S1).

Detailed component design

Event Service (E1)

Responsibilities

Stores and manages metadata for events: title, description, date/time, venue, etc.
Handles real-time updates via the Event Insertion Service and Virtual Queue.
Exposes endpoints for querying events (via API Read) and creating new ones (via API Write).

Scalability

Reads scale using Event Cache (E2) — a distributed in-memory store like Redis or Memcached.
Writes go through Virtual Queue (MQ1) to decouple and throttle event ingestion, allowing eventual consistency and resilience to spikes.

Algorithms & Structures

Indexing with B-Trees or LSM Trees in databases for efficient range queries.
Time-based partitioning to segment events (e.g., by month/year) — improves query performance.
Concurrency control via optimistic locking to avoid conflicts when updating event records.

Payment Service (P)

Responsibilities

Interfaces with external Payment Gateway (PG).
Logs transactions in the Payment Backlog (PB).
Integrates with Booking Service to finalize purchases.

Scalability

Idempotent APIs: Designed so retries (e.g., due to timeouts) don’t double-charge users.
Message queues (e.g., RabbitMQ or Kafka) are used to persist and retry failed transactions.
Sharded payment logs and eventually consistent updates using a transactional outbox pattern.

Algorithms & Structures

Merkle Trees or hash logs for fraud detection / audit trail validation.
Rate limiting & circuit breakers to isolate faulty 3rd-party gateway behavior.

Search Service (S1)

Responsibilities

Indexes and enables full-text + metadata-based search for events, tickets, users.
Integrates results from Event Cache and Feed Service.

Scalability

Backed by a distributed search engine like Elasticsearch or OpenSearch.
Can scale horizontally — adding new nodes to expand capacity and improve availability.

Algorithms & Structures

Inverted index for full-text search.
TF-IDF / BM25 scoring to rank relevance.
Caching + Bloom Filters to optimize common or popular querie

Trade offs/Tech choices

1. API Gateway and Load Balancer

Choice:

We placed an API Gateway in front of a Load Balancer to control, route, and throttle traffic before it reaches microservices.

Pros:

Centralized authentication, compression, and routing.
Improves observability and monitoring (e.g., metrics, rate-limiting).
Allows smoother rollout of new services or A/B testing.

Trade-off:

Adds latency overhead from request inspection and transformation.
Requires careful versioning of APIs and tight coordination between teams.
Why: The added control and security at the gateway layer are worth the tradeoff in latency, especially when handling high-throughput user-facing traffic.

2. Caching with Redis (Event Cache, Query Cache)

Choice:

Used Redis/Memcached as a distributed cache in Event Service and Search Service to serve frequently accessed data.

Pros:

Reduces read latency from DB (~10–100x faster).
Handles read-heavy workloads more gracefully.

Trade-off:

Possible cache inconsistency (stale data) in case of write-heavy operations.
Adds complexity in cache invalidation logic.
Why: Since most user traffic involves reading events or search results, caching improves user experience without scaling database reads aggressively.

3. Asynchronous Processing with Queues (Virtual Queue, Retry Queue)

Choice:

We introduced queues (e.g., Kafka/RabbitMQ) for event ingestion and payment retries.

Pros:

Decouples slow processes from the user experience.
Handles load spikes better with backpressure.
Makes the system more resilient to temporary failures.

Trade-off:

More complexity: messages must be idempotent and durable.
Delayed consistency — not ideal for time-critical flows.
Why: Queues let us scale components independently, especially write-intensive ones like event insertion and payments.

4. Use of Distributed Search Engine (Elasticsearch)

Choice:

Used Elasticsearch (or OpenSearch) for Search Service to enable fast, ranked, and fuzzy search queries.

Pros:

Inverted indexing offers blazing fast search performance.
Supports relevance scoring, autocomplete, and filtering.

Trade-off:

Data duplication — events need to be stored in both main DB and search index.
Requires managing index sync and background workers.
Why: The benefits of personalized and ranked search results outweigh the cost of managing another data pipeline.

5. Separate Read & Write APIs (Ap1 / Ap2)

Choice:

Split API responsibilities into Read (Ap1) and Write (Ap2) APIs.

Pros:

Independent scaling — read APIs can scale aggressively without impacting write latency.
Improves fault isolation — one failure won’t take down both.

Trade-off:

More services = more deployment/configuration overhead.
Requires client awareness of read vs write endpoints

Failure scenarios/bottlenecks

General Bottlenecks:

ComponentBottleneck ExampleResolution Strategy
Event DB	Write contention on same `event_id`	Partitioning / DB queueing
Notification	Rate limits on email/SMS APIs	Use queues + batch notifications
Payment Service	Synchronous dependency on Payment Gateway	Move to async with status callbacks
Feed/Search	Sudden traffic from trending events	Auto-scale replicas + cache hot results

Failure Scenario Example (Booking Surge):

A viral event launches.
10K users hit API Gateway → Load Balancer → Booking Service.
Event DB experiences contention on available_seats.
Payment Gateway has timeout errors, users retry.
Virtual Queue buffers overflowed with booking tasks.
Some seat bookings fail silently → User dissatisfaction.

1. Eventual Consistency via Event-Driven Architecture

What: Replace some synchronous service-to-service calls (e.g., booking → payment → notification) with event-driven communication (e.g., Kafka, NATS, or Amazon SNS/SQS).
Why: Increases fault tolerance and decouples services.
Failure Mitigation: If a downstream service fails, messages stay in the queue and are retried independently.

2. Distributed Locking for Booking Logic

What: Introduce Redis-based distributed locks (e.g., Redlock) to manage concurrent seat booking.
Why: Avoids race conditions/overselling tickets in high concurrency environments.
Failure Mitigation: Ensures atomic operations for inventory control under peak traffic.

3. Chaos Engineering

What: Introduce tools like Gremlin or Chaos Monkey to simulate service crashes, latency, etc.
Why: Proactively identify system weaknesses before they hit production.
Failure Mitigation: Catch cascading failures early and build resilience.

4. Auto-Scheduling + Load Prediction with ML

What: Use past event data to predict traffic surges and auto-scale resources (compute, cache, DB replicas) beforehand.
Why: Helps prepare for peak events, reduces latency and cost.
Failure Mitigation: Prevents DB/cache bottlenecks under load.

5. Geo-Redundancy & Multi-Region Deployments

What: Deploy services in multiple regions, with geo-aware DNS routing.
Why: Protects against region-wide cloud outages.
Failure Mitigation: Keeps core features up even if a region fails.

6. Decouple Notification System with Fallbacks

What: Retry on multiple providers (e.g., SendGrid → Mailgun → SES) and store failed messages for delayed processing.
Why: Improves delivery reliability.
Failure Mitigation: No lost or missed critical alerts (like ticket confirmation).

7. Global Rate Limiting and Throttling

What: Implement global token bucket or leaky bucket rate limiting across users, IPs, and endpoints.
Why: Prevent system overload and abuse.
Failure Mitigation: Gracefully sheds load during traffic spikes.

8. Dedicated Write Service with Idempotency

What: Use a write-ahead service to queue and validate write requests, ensuring deduplication via idempotency keys.
Why: Prevents data duplication, especially under retries.
Failure Mitigation: Avoids issues like double billing or duplicate bookings.

9. Monitoring, Alerting & Dashboards

What: Use tools like Prometheus + Grafana, ELK stack, and Sentry for deep observability.
Why: Detect and resolve issues faster.
Failure Mitigation: Prevents silent failures; alerts you before users complain.