System requirements


Functional:

User is able to:

  1. Search for a theater by city or ZIP code.
  2. After selecting a theater, search for movies shown the the theater.
  3. After selecting a movie, select which show (e.g. show starting at 1PM, 3PM, 5PM, ...)
  4. After selecting a show, user is presented a list of available seats.
  5. User can select the seats and finalize the booking.


[Generally speaking, you would like to keep the requirements scope small. You only have 35 - 50 min in an interview. If you have a lot of requirements, you'd risk running out of time. We could have other requirements, e.g., selecting a movie first then a theater, movie recommendations, membership, luxury seats ... But the central theme of this problem is how to achieve both transactions and scalability. You'd like to focus on that.]


Non-Functional:


We focus on the following aspects, as they are the most important:

  1. Consistency - once a booking is made, it has to be respected. No double bookings.
  2. Concurrency - multiple users are interacting with the service to view seats and book seats. Make sure system handles concurrent accesses and provide consistent bookings.
  3. Scalability
  4. Fault Tolerance


We can relax a little bit on response time. End users are used to booking something taking substantial amount of time. If some requests take a few seconds, that would most likely be acceptable.


[A common mistake is to write non-functional requirements and do nothing about it. A better approach is to focus on non-func requirements that impact the design. For example, in this problem, consistency, concurrency, and relaxed response time requirement lead us to selecting relational DB.]


Capacity estimation

This service covers 1000 cities worldwide.

Each city has 4 theaters on average.

Each theater plays on average 10 movies.

Each movie has 4 shows on average.

Overall, 100M movie tickets are sold monthly.


For each booking, I will estimate it would take 256B to store, considering:

  • IDs like user, theater, movie, show, ...
  • timestamp
  • seats
  • status


[It's a delicate balance on how deeply you go on data models during estimation. It is important to have a reasonable number, but it is also important not to spend too much time. We haven't even selected a database, so data model may change a lot later. Just some reasonable number you can defend, like above, would do.]


256 * 100M = 25.6GB / month

In two years, it would require 614.4GB. Considering some user growth and open capacity, let's say we will need 1TB in the main DB.


Considering the importance of consistency and concurrency, and considering the data amount is within reasonable capacity of RDB, we will pick RDB (e.g. MySQL, PostgreSQL) as the primary data base.


API design

Let's think about a journey of a user through this service:


  1. User searches for theaters: search_theater(user_ID, city, zip, lat/longitude, size) -> returns theater ID
  2. User selects a theater and searches for movies: search_movies(user_ID, genre, rating, keyword) -> returns JSON objects including movie IDs, description and media URIs for previews.
  3. User selects a movie and receives a list of possible shows for the movie: list_shows(user_ID, movie_ID, theater_ID, date) -> returns JSON objects including show ID, movie ID, theater ID, start and end time.
  4. User selects a show and views open seats: list_open_seats(user_ID, show_ID) -> returns JSON object including a list of available seats.
  5. User selects seats for bookings: select_seats(user_ID, show_ID, array of seat IDs) -> This holds the seats for the user for N (e.g. 5) minutes. Within the N minutes, user can finalize booking and pay. If user does not do this in N minutes, the seats would become available for other users. This returns a booking ID and price.
  6. User books the seats. book(user_ID, booking_ID). -> This returns the price. This starts a payment transaction with an external payment provider.
  7. After paying using an external provider, user finalizes booking:

finalize_booking(user_ID, booking_ID) -> This finalizes the booking in the backend. Now the seats belong to the user.


[We discuss API by walking through the steps in this question because the flow is complicated. It helps interviewer & candidate to get on the same page. For other questions, let's say TinyURL, the flows are so simple you can just list APIs.]


Errors are communicated to clients via HTTP error code, e.g., 3XX if client sends wrong user IDs or booking IDs. 4XX if a problem occurs in the server side, e.g., rate limiting or detected conflicting requests.


An important error scenario is if the user fails to complete a transaction (e.g. failure to commit, failure to pay) after selecting seats for bookings.


Steps (5) - (7) represent a transaction. The steps (5) - (7) should succeed completely, or fail not. In case it fails (e.g. user changes their mind, or does not pay on time), the transaction has to be rolled back. In the following, I will make sure of this design.


[Failure is also a very good thing to (briefly) consider at this point. It prepares you for a fault tolerance discussion in the deep dive.]


Database design

[Senior-level deep dive topic. When you choose RDB as a data store, it would usually present a good deep dive opportunity. RDBs have wonderful capabilities (consistency, concurrency), but they require careful design on data models and scalability. It is a good opportunity to showcase your knowledge. Reference: https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/]


We will focus on the tables that are important for consistency and concurrency of bookings.


Booking is important because this is the final outcome the booking system. Let's start from Bookings and expand to other important tables:


Bookings:

  • booking_ID
  • user_ID
  • created_time
  • status : enum: held, booked, completed, expired


Shows:

  • show_ID : primary key
  • movie_ID : foreign key to Movies table


Seats:

  • seat_ID : primary key
  • theater_ID : foreign key to Theaters table


# Since show and seats have many-to-many relationship, We will create this join table.


Show_Seats:

  • show_seta_ID : primary key
  • show_ID : foreign key
  • seat_ID : foreign key
  • booking_ID: foreign key to Bookings table


Importantly, Show_Seats table has a unique key constraint with a composite key show_ID + seat_ID. This avoids double booking.


For example, if two users try to book a same seat on a same show, one of them would fail to complete the CREATE operation.


If this happens, the UI would show an error, e.g., "Sorry, we could not complete your booking. Please select different seats and try again."


There are many other tables, like Users, Movies, Theaters. For the benefit of time, we will not go to their details.


[There is a temptation to list all the possible tables, define all data types and indices, and draw beautiful diagrams. But that would take up all the time for the interview. Only talk about tables that are important to your whole story.]


There are a couple of different approaches to achieve consistency and concurrency:


  1. Rely on RDB's native capabilities, e.g., unique key constraints, pessimistic or optimistic locking.
  2. Develop a stored procedure in RDB.
  3. Rely on applications ability to provide synchronization, e.g., by distributed locks.


For this application, we will explore (1) first.


(2) is powerful. But developing, debugging and maintaining a tricky stored procedure will be costly and error-prone. (1) is a more well understood way.


(3) is possible, but it will be more error-prone during the life of application development. One buggy software release and it might corrupt the database. (1) is a safer approach in terms of protecting the data.


High-level design

At a high level:


Client (either browser or mobile app) first interacts with CDN. CDN caches static contents (e.g. movie trailers, theater images). It also protects the backend systems from DoS, terminates TLS, and so on.


API Gateway distributes requests multiple servers providing multiple services. In this picture, we only drew Booking Service. But in a real system, there would be other services, e.g., services for showing trailers, making recommendations, etc. It also acts as a load balancer.


We decided against breaking Booking Service further. All the APIs we discussed in this exercise are similar in the sense that they all write to common set of RDB tables. We think it makes sense to keep this in one service.


As we mentioned, RDB should be the main data store for this service.


We also introduced Redis Cache to boost performance.


All the objects presented in the RDB tables, e.g., Theater, Movie, Show, User - these are accessed by the Booking Service in read-heavy fashion. Therefore, it'd be a performance boost to cache them. As consistency is the key, we should use a caching policy that provides strong consistency between the cache and the DB, e.g., write through policy. LRU cache eviction policy can be used to take advantage of access locality.




Request flows


All requests, such as search_theater(), search_movie, book(), are issued by the client, come through the API Gateway, and are handled by Booking Service.


Most requests would read or write to the RDB tables.





Detailed component design

[Senior-level deep dive topic.]


Let's talk about database and concurrency. These three operations must be done in a transactional way.


  1. User selects seats for bookings: select_seats(user_ID, show_ID, array of seat IDs)
  2. User books. book(user_ID, booking_ID).
  3. After paying using an external provider, user finalizes booking: finalize_booking(user_ID, booking_ID)


In other words, if anything fails within (1) - (3), the transaction must be rolled back as if nothing happened.


We also have to do this with concurrency in mind. When multiple users try to book the same seats on a same show, the system must reject one of the request, gracefully.


When select_seats() is called, we create rows in Show_Seats table to signify that the seats for this show are held for N minutes. We can express this by setting the rows' status to "held", with a timestamp.


We will use uniqueness constraints on Show_ID and Seat_ID on this table to provide consistency.


This is an example of SQL statement:


insert into Show_Seats table (Show_ID, Seat_ID, status, timestamp) values (100, 10, held, 4-22-2024 12:00:00), (100, 11, held, 4-22-2024 12:00:00), (100, 12, held, 4-22-2024 12:00:00)


This operation would fail and be rolled back, for example, if another client successfully booked seats 12 on the same show. The important thing is that this insert statement inserts all the seats in the booking at once. This way, if the operation fails, it would be rolled back for all the seats.


Going to booking stage, we need to update the status to "booked".


begin transaction select status, timestamp from Bookings where booking_ID = 200 for update update Bookings set status='booked', timestamp = '4-22-2024 12:00:02' where booking_ID = 200 end transaction



This is using pessimistic locking. We assume performance would be acceptable because it will only lock these rows. At the booking or finalization stage, only one client would be accessing these rows, so we think the scalability aspect would be OK.


If performance measurement reveals we need to improve scalability more, we can use optimistic locking. In which, a client would go ahead and write to these rows without locking them. If write conflict occurred (which you can tell by inconsistent version number on the rows), you would rollback the write operation and retry.


Optimistic locking would increase write performance.

If the write conflict is rare, it would be a performance boost over pessimistic locking.


However, optimistic locking would make the application logic more complex. When write fails due to conflict, it would be the application's responsibility to resolve the conflict.


As such, we would first implement the pessimistic locking approach, and measure performance. If it is not performant enough, we would invest in optimistic locking or different approaches such as stored procedures, or implementing synchronization in the application code (Booking Service).


We may choose to recommend certain seats before user selects their seats. For example, greedy algorithm can be implemented to choose seats in a simple policy, e.g., consecutive seats as close as possible to the front row.



Trade offs/Tech choices


See Database Design section for the tradeoff discussion of choosing the data store and how to implement consistency and concurrency.


Failure scenarios/bottlenecks


Booking Cancelation

[Mid-level deep dive topic.]


If a user fails to pay (e.g. invalid credentials on the Payment System, timeout, etc.), the booking transaction must be rolled back.


update Bookings set status='canceled', timestamp = '4-22-2024 12:00:02' where booking_ID = 200


We then need to update (or remove) associated Bookings objects from cache.


Because RDB provides strong consistency, once booking is canceled, all the clients who subsequently read from the RDB would see the correct state (i.e. booking is canceled).


Future improvements


Database Scalability

[Mid-level deep dive topic.]


Read replicas can be used to enhance read throughput, and fault tolerance of the RDB.


Read access can be handled by the replicas, freeing the primary node to focus on handling writes.


When a primary node goes does for any reason, a read replica can take over the primary (write) responsibility.


Sharding can also enhance scalability. Show_ID would be a good candidate for the sharding key because one shard can serve all the accesses to a particular show. We can employ consistent hashing to distribute load among the nodes evenly.