My Solution for Design Ticketmaster with Score: 9/10

by utopia4715

System requirements


Functional:

User is able to:

  1. Search for a theater by city or ZIP code.
  2. After selecting a theater, search for movies shown the the theater.
  3. After selecting a movie, select which show (e.g. show starting at 1PM, 3PM, 5PM, ...)
  4. After selecting a show, user is presented a list of available seats.
  5. User can select the seats and finalize the booking.



Non-Functional:

I'd like to focus on the following aspects, as they are the most important:

  1. Consistency - once a booking is made, it has to be respected. No double bookings.
  2. Concurrency - multiple users are interacting with the service to view seats and book seats. Make sure system handles concurrent accesses and provide consistent bookings.
  3. Scalability
  4. Fault Tolerance


We can relax a little bit on response time. End user is used to booking something in a multi-minute session, so we don't have to optimize for user response time very hard (e.g. if some request takes a second or two, that would most likely be acceptable.)


There can be more additional features like movie recommendations, membership, luxury seats, etc., but I will not focus on them for now.



Capacity estimation

This service covers 1000 cities worldwide.

Each city has 4 theaters on average.

Each theater plays on average 10 movies.

Each movie has 4 shows on average.

Overall, 100M movie tickets are sold monthly.


For each booking, I will estimate it would take 256B to store, considering:

  • IDs like user, theater, movie, show, ...
  • timestamp
  • seats
  • status


(I can get to more accurate estimates as I work on data models later.)


256 * 100M = 25.6GB / month

In two years, it would require 614.4GB. Considering some user growth and open capacity, let's say we will need 1TB in the main DB.


Considering the importance of consistency and concurrency, and considering the data amount is within reasonable capacity of RDB, I will pick RDB (e.g. MySQL, PostgreSQL) as the primary data base.


API design

Let's think about a journey of a user through this service:


  1. User searches for theaters: search_theater(user_ID, city, zip, lat/longitude, size) -> returns theater ID
  2. User selects a theater and searches for movies: search_movies(user_ID, genre, rating, keyword) -> returns JSON objects including movie IDs, description and media.
  3. User selects and movie and views possible for shows: list_shows(user_ID, movie_ID, theater_ID, date) -> returns JSON objects including show ID, movie ID, theater ID, start and end time.
  4. User selects a show and views open seats: list_open_seats(user_ID, show_ID) -> returns JSON object including a list of available seats.
  5. User selects seats for bookings: select_seats(user_ID, show_ID, array of seat IDs) -> This holds the seats for the user for N (e.g. 5) minutes. Within the N minutes, user can finalize booking and pay. If user does not do this in N minutes, the seats would become available for other users. This returns a booking ID and price.
  6. User books. book(user_ID, booking_ID). -> This returns the price. This starts payment transaction with an external payment provider.
  7. After paying using an external provider, user finalizes booking: finalize_booking(user_ID, booking_ID) -> This finalizes the booking in the backend. Now the seats belong to the user.


Errors are generally communicated to clients via HTTP error code, e.g., 3XX if client sends wrong user IDs or booking IDs. 4XX if a problem occurs in the server side, e.g., rate limiting or detected conflicting requests.


An important error scenario is if the user fails to complete transaction (e.g. failure to commit, failure to pay) after selecting seats for bookings. In a way, steps (5) - (7) is a transaction. The whole transaction of (5) - (7) succeeds, or not. In case it fails (e.g. user changes their mind, or fails to pay), the transaction has to be rolled back. In the following, I will make sure of this design.


Database design


I will focus on the tables that are important for consistency and concurrency of bookings.


Booking is made on particular seats on a particular show. So let's start from them:


Shows:

  • show_ID : primary key
  • movie_ID : foreign key to Movies table


Seats:

  • seat_ID : primary key
  • theater_ID : foreign key to Theaters table


# Since show and seats have many-to-many relationship, I will create this join table.

Show_Seats:

  • show_seta_ID : primary key
  • show_ID : foreign key
  • seat_ID : foreign key
  • status : enum: held, booked, completed, expired
  • booking_ID: foreign key to Bookings table


Importantly, Show_Seats table has unique key constraint with a composite key show_ID + seat_ID. This avoids double booking.


For example, if two users try to book a same seat on a same show, one of them would fail to complete the CREATE operation.


If this happens, the UI would show an error, e.g., "Sorry, we could not complete your booking. Please select different seats and try again."


There are many other tables, like Users, Movies, Theaters, ... but I will only list the ones that are critical in achieving consistency and concurrency.


To achieve consistency and concurrency, there are a couple of different approaches:

  1. Rely on RDB's native capabilities, e.g., unique key constraints, pessimistic or optimistic locking.
  2. Develop a stored procedure in RDB.
  3. Rely on applications ability to provide synchronization, e.g., by distributed locks.


For this application, I will explore (1) first.

(2) is powerful, but developing, debugging and maintaining a tricky stored procedure will be costly and error-prone. (1) is a more well understood way.


(3) is possible, but it will be more error-prone during the life of application development. One buggy software release and it might corrupt the database. (1) is a safer approach in terms of protecting the data.


High-level design

At a high level:


Client (either browser or mobile app) first interacts with CDN. CDN caches static contents (e.g. movie trailers, theater images). It also protects the backend systems from DoS, terminates TLS, and so on.


API Gateway distributes requests multiple servers providing multiple services. In this picture, I only drew Booking Service. But in a real system, there would be other services, e.g., services for showing trailers, making recommendations, etc. It also acts as a load balancer.


I decided against breaking Booking Service further. All the APIs I discussed in this exercise are similar in the sense that they all write to common set of RDB tables. I think it makes sense to keep this in one service.


As I mentioned, RDB should be the main data store for this service.


I also introduced Redis Cache to boost performance.


All the objects presented in the RDB tables, e.g., Theater, Movie, Show, User - these are accessed by the Booking Service in read-heavy fashion. Therefore, it'd be a performance boost to cache them. As consistency is the key, we should use a caching policy that provides strong consistency between the cache and the DB, e.g., write through policy.



Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...






Detailed component design

Let's talk about database and concurrency. These three operations must be done in a transactional way.


  1. User selects seats for bookings: select_seats(user_ID, show_ID, array of seat IDs)
  2. User books. book(user_ID, booking_ID).
  3. After paying using an external provider, user finalizes booking: finalize_booking(user_ID, booking_ID)


In other words, if anything fails within (1) - (3), the transaction must be rolled back as if nothing happened.


We also have to do this with concurrency in mind. When multiple users try to book the same seats on a same show, the system must work gracefully.


When select_seats() is called, we need to create rows in Show_Seats table to signify that the seats for this show are held for N minutes. We can express this by setting the rows' status to "held", with a timestamp.


We will use uniqueness constraints on Show_ID and Seat_ID on this table to provide consistency.


insert into Show_Seats table (Show_ID, Seat_ID, status, timestamp)

values (100, 10, held, 4-22-2024 12:00:00),

(100, 11, held, 4-22-2024 12:00:00),

(100, 12, held, 4-22-2024 12:00:00)


This operation would fail and roll back, for example, if another client successfully booked seats 12 on the same show. The important thing is that this insert statement inserts all the seats in the booking at once. This way, if the operation fails, it would be rolled back for all the seats.


Going to booking stage, we need to update the status to "booked".


begin transaction

select status, timestamp from Show_Seats where Show_ID= 100 and Seat_ID in (10, 11, 12)

for update


update Show_Seats set status='booked', timestamp = '4-22-2024 12:00:02' where Show_ID=100 and Seat_ID in (10, 11, 12)

end transaction


This is pessimistic locking. I assume performance would be acceptable because it will only lock these rows. At the booking or finalization stage, only one client would be accessing these rows, so I think the scalability aspect would be OK.


If performance measurement reveals we need to improve scalability more, we can use optimistic locking. In which, a client would go ahead and write to these rows without locking them. If write conflict occurred (which you can tell by inconsistent version number on the rows), you would rollback the write operation and retry.


Optimistic locking would increase write performance.

Instead of taking locks before writing, Booking Service process would go ahead and write to tables. Write conflicts (due to concurrent writes) can be discovered by checking the version number of the objects to be written. If the write conflict is rare, it would be a performance boost over pessimistic locking.


However, optimistic locking would make the application logic more complex. When write fails due to conflict, it would be the application's responsibility to resolve the conflict.


As such, I would first prototype pessimistic locking approach, and measure performance. If it is not performant enough, I would invest in optimistic locking or different approaches such as stored procedures, or implementing synchronization in the application code (Booking Service).


We may choose to recommend certain seats before user selects their seats. For example, greedy algorithm can be implemented to choose seats in a simple policy, e.g., consecutive seats as close as possible to the front row.


Redis Cache serves hash map like data structure for objects such as Users, Theaters, and Movies.


Trade offs/Tech choices





Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.






Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?