Design Ticketmaster - System Design

System requirements

Functional:

End users and sellers. For end users, they may use browser or mobile app to visit our service.

End user side:

User browse events
User search an event
User select an event
Show user available dates and times
User select date and time
Show user available seats
User select seats
Show user payment page
User make purchase
Send user purchase confirmation via text and/or email
Send user e-tickets
Send user reminders of their purchased tickets a few days before event starts

There's also functions on ticket sellers side:

Create an event
Dates and times, locations
Tickets and pricing
Start and end sale dates/times

Ticketmaster revenues:

will take a small portion of sales revenue from sellers
seller vip subscriptions that provides advanced services

We'll focus on architecture of end user side for this design.

Non-Functional:

Service should be highly available. Tickets and Payments should be strictly consistent. For tradeoff, you don't want users to have purchased tickets but some are lost or double sold, or user made payments but not receiving tickets, or user payment failed but still got tickets. Therefore, strict consistency on sold tickets and payments take precedence of service availability.

Capacity estimation

Say DAU visit is 5 million. 10% make a purchase. On avg, eash user visits 5 pages

Read QPS = 5 million * 5 / 86400 = 290

Write QPS = 5 million / 86400 = 60

Peak traffic typically is 2-5x. Choose 3x.

Peak R/W QPS = 870 / 180

API design

View/Browse:

GET /events: show most popular events based on user geo locations.
GET /events/<event_id>/dts: show a specific event's available dates and times
GET /events/<event_id>/dts/<dt_id>: show tickets of a specific date and time

Purchase:

POST /events/<event_id>/dts/<dt_id>
Payload: ticket ids, pricing of each ticket, user payment method (encoded and/or encrypted)
response: ack or nack. If ack, order confirmation id

Database design

Event:

event_id: one unique id per event

event_unique_id: one unique id per event per location per datetime

datetime: date and time of this event_unique_id

location_id: location of this event_unique_id

ticket_tiers: pricing and number of available tickers by tier

Order:

order_id

user_id

event_unique_id

ticket_ids

total_payment

is_canceled

is_refunded

Ticket:

ticket_id

order_id

event_id

event_unique_id

user_id

price

created_at

last_updated_at

is_used

is_canceled

is_refunded

Location:

location_id

description

address

seating_map

User profile:

user_id

last_name

first_name

middle_name

address

phone

payment_method: this needs to be hashed/encrypted

unused_tickets: we can keep a copy of unused tickets here for fast retrieval when user visits their profile

High-level design

See high level diagram

Request flows

View or Purchase Flow:

As described in the Functional Requirement section. Say user enters in a specific event, date and time:

server sends back seating map and available tickets, pricing
user select tickets, go to payment, which sends a request to server
server holds the tickets for user
user sends payment method to server
server forward payment method to payment service provider for verification
server gets verified ack, creates an order, creates tickets attached to user and order
server deducts available tickets from event table
server stores new objects in database, and send back user an order confirmation id

View Purchased Tickets

user go to profile, or ticket view page, sending a request to server
server pulls tickets from database (or cache) and responds to user

Use Ticket

ticket scanner reads a ticket barcode or QR code, sending it to server
QR code may be dynamic in order to prevent fraud or stealing. So, server translates QR code back to ticket id
server verifies the ticket is not used, not canceled and/or refunded, and then mark it as used
server responds back to scanner that ticket is marked used
on user device, the ticket being used may be updated the next time user sees it

Cancel order and/or ticket

user view an order or ticket, click cancel, sending a request to server
server marks order or ticket as canceled, responding back to user

We have a batch job running in background to periodically check DB updates, and send out text/emails for order confirmations, newly purchased tickets, and reminders.

reminders are scheduled (created) when batch first sees new updates, and are sent when batch sees their reminding time is reached.

Detailed component design

Servers:

At peak hours, many users may be looking at the same event.

To reduce latency, we need cache most recent popular events, typically in a LRU or LFU setup. These include the event profiles and their available tickets.

To ensure availability and consider disaster recovery, we need replicas. Leader should propagate updates to followers.

Server should immediately deduct an available ticket from its cache/memory as soon as it receives a user purchase intent (i.e. go to payment page, but not yet submit payment), and make it held for a period of time, say 10 minutes. User payment action triggered after the expiration will result a failure, because the tickets have been released back to available pool for other users to purchase. Otherwise, other users will not be able to purchase the held tickets.

Thus for payment, server needs to check if ticket is still available before sending payment info to third party payment servicer for verification.

We choose consistency over latency or availability here, as it involves financial transactions. For strong/strict consistency, we want the payment info, order and tickets being updated in all replicas before we confirm with user their purchase is made successfully.

To handle peak hours, we also want to route users to different servers. This may be based on consistent hashing on their user ids or auth tokens, to ensure randomness of traffic distribution, while also provides consistent user experience and better caching as the same user's traffic is route to the same server.

Database:

As data are not very well normalized, and write speed is critical during peak hours, we lean to choose append-only DB or key-value store for perf.

A few options on DB partitions:

by user ids:
Pros: low latency for users on retrieving tickets and orders
Cons: higher latency for servers to gather all info for an on-going sales event
by event ids:
Pros: fast updates when there's any ticket changes on an event. Easy to scale out.
Cons: viewing and using tickets may be slower, especially during the start time of a popular event when everyone is pulling their ticket
by order ids or ticket ids:
Pros: independent from users or events, avoid hotspot problem
Cons: not much gain on user experience or reducing latency

Mitigations: in any of the options, we could cache popular events (typically those that just recently go on sale, or event time is approaching) to reduce latency for read. Note that write (i.e. purchase) should still choose consistency over latency.

Regardless of which partition method we use, we should always keep replicas of data, as we certainly don't want to lose anything, especially when financial transactions are involved. 3 replicas of 99.9% SLA under strong consistency yield to an avg down time of less than 32 millisecs per year.

To further reduce possibility of data loss, for every DB write, we could first write to their write-ahead log.

Trade offs/Tech choices

We have chosen consistency over availability or latency on writes, because of financial transactions being involved. To balance a little to availability and latency, instead of having all replicas to confirm before we finally confirm and commit change, we could use consensus and version vectors. By choosing W + R > N (num of write replica + num of read replica > num of total nodes), we guarantee that at least one node has the up-to-date data. With version vectors, we'll be able to tell which node has the latest data. This will give lower latency on making ticket purchases, however, add system complexity.

We may consider using CDN for caching popular events in different geographic regions, and let servers to route user read requests to their closest CDN server first. This is typically useful when a popular event's ticket sale initially goes live on the platform.

Failure scenarios/bottlenecks

The most latency comes from where user makes a purchase. However, as it's financial transaction, we think it's acceptable. In real world, an event like Amazon Prime Day has millions to billions of transactions, when people may experience service interruption intermittently in peak time or for popular items; or something like a massive selloff in a stock exchange, retail users may not be able to send out orders.

There is a very small chance of data loss (< 32 ms per year from prior calculation). The only concerned scenario in our design is when a payment is made on user's payment method, but all servers or databases go down, such that the user does not receive confirmation and tickets. In this case, we'll involve the payment service provider, investigating their transaction logs with ours, find the gap, and make fixes.

Future improvements

For limited time, we didn't cover seller platform details.

On user side, we may consider adding reviews and comments for events, letting user to decide when they want to receive reminder notifications, etc.

We should also prevent attackers from abusing our system, e.g., blocking/holding lots of event tickets by making lots of purchase intents but not actually making purchase, or sending lots of invalid payment info, etc.