System requirements


Functional:

List functional requirements for the system (Ask interviewer if stuck)...


users:

customers

  • purchase movie tickets
  • create an account

admins

  • ticket inventory, pricing, promotion
  • add/remove listings, theatres, showtimes


customer create an account and login

admins have their own special panel


features:

  • user registration and authentication
  • display a list of available movies, showtimes, theatres
  • genre, casting, ratings, trailers
  • users can pick seats in theatres
  • update seat availability in real time
  • payment processing for ticket purchase
  • support different payments
  • send confirmation with booking details
  • notify customers of changes or cancellations
  • customers can add movie ratings and add reviews
  • display overall rating for each movie
  • movie management - add movies, update existing movies, set showtimes
  • theatre management - add new theatres, edit existing theatres, update showtimes at each theatre


Because each set of actions can be encompassed by a category, they will be separated by microservices and live in each respective microservice. For example, payment-related actions will live in the payments service.


Non-Functional:

List non-functional requirements for the system...


non-functional requirements

  • needs to be fault tolerant for both users and admins at any given time
  • real time write updates - writes needs to be fast
  • read delays are minimal - reads also need to be fast
  • low latency
  • peaks will increase reads/writes
  • scale dynamically


the system needs to be able to scale dynamically as there are more traffic so we will need to look at things like horizontal scaling, reads needs to be fast since there are more reads than writes so we can consider caching reads for more popular movies and theatres, and it needs to be highly performant and available to users and admins to provide a seamless experience. because there can be high peak traffic, we want to ensure that the load is also evenly distributed through load balances that can be scaled and the load balancers will handle distributing traffic to ensure that no single server is overwhelmed causing application degradation. because we want to ensure that admins and users are routed properly, we will also need an api gateway. to ensure high performance, we need to ensure that the application survives hardware faults and network failures and that there are no single points of failures. since we need highly available data but also highly consistent data, we can look at the CAP theorem - we cant have both so we will need to make tradeoffs. this will depend on the data model but we can also utilize consistent hashing to help with load and availability.


given that we will also need to host trailers and they are video media, we want to also consider using a CDN to host and serve files. since we need low latency, we will want to have multiple CDNs that serve content to users based on their locale.


in order to ensure high performance and fault tolerance, we will also need to consider monitoring and logging with tools such as elastic search, logstash, and kibana and monitoring such as grafana.


because each action is an event, we will also want to consider an event driven architecture using apache kafka to handle decoupling of actions and services.


Capacity estimation

Estimate the scale of the system you are going to design...


user traffic - peak during weekends or holidays

traffic spikes during movie releases or special events


1000 users concurrently during peak hours

50 admins/day


10,000 reads / hour during peak

500 writes / hour throughout the day


updates within milliseconds - real time updates


API design

Define what APIs are expected from the system...


user creates an account - POST /v1/users/create_account

user sign-in and authentication- GET /v1/users/authenticate


retrieve movies, showtimes, theatres - GET /v1/movies/#{movie}/listings

retrieve seats available - GET /v1/theatre/#{theatre_name}/seats

user picks seat - POST /v1/theatre/#{theatre}/pick_seats


payments - POST /v1/payments/#{payment_type}

send booking details - POST /v1/booking_details/send_details

send change details - POST /v1/booking_details/send_change_details


user adds movie ratings - POST /v1/movies/#{movie}/add_rating

user updates movie ratings - PUT /v1/movies/#{movie}/update_rating


add movie - POST /v1/movies/add_movie

remove movie - DELETE /v1/movies/#{movie}/remove_movie

update movie - PUT /v1/movies/#{movie}/update_movie


add theatre - POST /v1/theatre/add_theatre

remove theatre - DELETE /v1/theatre/#{theatre}/remove_theatre

update theatre - PUT /v1/theatre/#{theatre}/update_theatre

set showtimes - POST /v1/theatre/#{theatre}/set_showtimes

update showtimes - PUT /v1/theatre/#{theatre}/update_showtimes


within API design, we will also need to consider actions such as rate limiting, caching, authentication, and validation. these are actions that are going to be deferred to the API gateway. we have the option of building our own API gateway or purchasing an out-of-the-box, third party solution. there are pros and cons to each. if we have the man power, building our own api gateway will allow us to be flexible and to tailor the needs specific to our solution whereas a third party solution may not be as flexible, could have less complexity due to less features, but will save on resources.


when rate limiting, we want to consider the read/write usage of users and how they cause traffic spikes during peak hours. since we have 1,000 users concurrently during peak hours with 10,000 reads per hour during the peak, we can assume that each user during peak hours are making 10 reads per hour. because we want to avoid bots hitting us all at once or purchasing all the tickets at once, we want to limit this by using a gradual rate limiting rather than abruptly which can upset users. a sliding window approach will allow for a burst of activity with a specific time interval. we will inform the user when they are rate limited.


in our api gateway, we will also handle authentication and security but we will send our user information to the user service to determine if the user is valid, needs to create an account, or is able to log in. we can send tokens to signify the authentication result. the api gateway will also validate all incoming requests.


not only will this ensure a safe and fair access to the platform, by having authentication and rate limiting, we ensure that there is high availability within the system and that it is fault tolerant to high traffic. this helps with the overall scalability initiative.



Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...


data model:


users

  • customers who create accounts to purchase tickets
  • admins who manage movies, theatres, and bookings

movies

  • title, genre, duration, cast
  • relationship to theatres and showtimes

theatres

  • location, seating capacity
  • relationship to movies and showtimes

bookings

  • records of ticket reservations made by users for specific movies and showtimes
  • seat selection, payment status, booking timestamps

ratings and reviews

  • user-generated reviews and ratings for movies
  • relationship between users, movies, and ratings/reviews


based on the data model, there is a one-to-many relationship for the data entities. given this one-to-many relationship, we want to use a relational database such as mysql to handle the structured data and complex relationships. if we were to build the application in a server-side framework such as ruby-on-rails, we could utilize their extensive activerecord libraries in order to interact directly with the sql database.


since we are choosing mysql as our de-facto database, we want to consistency, availability, scalability, data integrity, replication, partitioning, multi-data center, leader election, load distribution (consistent hashing)


we need a highly peformant database given the speed of responses needed, high availability (no single point of failures, able to handle hardware or network failures), and scalable (many users during peak times)


because admins are in charge of adding and removing movie and theatre information, we don't expect any data to expire. this means that the TTL is infinite and we will be writing each event object. this means we will have fast writes and we can slice and dice the data however we need to.


consistency

  • because users are purchasing tickets and expecting accurate information, the data needs to be highly consistent. we expect all users to see the same information in real time so they are able to make informed decisions. this is important as a part of user experience.
  • we also need to provide transactional support and ACID properties in order to ensure data integrity

availability

  • the database needs to be highly available even through network issues or hardware failures
  • because this is a tradeoff between consistency and availabilty, we will focus more on consistency. this means that there can be some delay due to data replication to ensure that data is accurate

fault tolerance

  • we expect the database to function even when we are having hardware issues - this means that we should have multi-data center set up such that if one datacenter goes down, we should have a backup ready to serve data. we will also set up our leader election such that we use master-slave leader set up. this means that we will have one master and if the master goes down, we can elect a slave to be the master until the master returns. we can also have multiple data centers closest to the users locale in order to reduce latency. since we will be experiencing more reads than writes, we will also set up to have multiple read replicas to handle read traffic.

scalability

  • when considering scaling, we can think about sharding, multi data centers, and read replicas.
  • because we may be writing a lot of data, when we are scaling up our mysql database, we will need to shard it. sharding is the concept of horizontal scaling when we are adding additional nodes to increase storage space. when we are sharding the database, there are other factors we need to consider. we will need additional services such as a service discovery service (such as zookeeper) that will enable us to know when database nodes are available and healthy, we will also need to ensure that we have a shard proxying service that handles routing the calls to the correct shard, and a primary id generator and coordinator. this will ensure that we do not have any collisions between primary ids in each shard. we can utilize snowflake id algorithm to do so.
  • we will also need a shard proxy that will handle caching any query results, monitor db health, and terminate queries
  • by adding additional read replicas, writes will go to master but we can read from either master or the read replicas.
  • within the shard proxy, we can also add a bloom filter to enable fast lookups in the cache.

load distribution

  • we can use consistent hashing with our mysql database to ensure that load is evenly distributed among each node.
  • consistent hashing is a good option because when there is a node that goes down, we wont have to rehash all the data limiting the amount of data migration that happens during redistribution. this also means that only a subset of the data will be affected when a node goes down.
  • consistent hashing will also help us avoid hot and cold partitions and will allow us to redistribute data evenly to avoid hot spots.

replication

  • because we will have multi data centers, we should use multi leader replication in order to replicate between multi data centers. this will allow us to use the backup data center in case the original data center goes down
  • by waiting for replication to complete, we will be able to ensure that there is no data loss however, latency may increase and availability may suffer


overall, this setup helps us avoid a single point of failure which would be a bottleneck.


this database set up (using mysql) was chosen over a nosql database such as cassandra because:

  • the data is relational and should be stored in a normalized manner, cassandra prefers query-based data models and is comfortable with denormalized data
  • cassandra is eventually consistent but we need strong consistency within our data
  • cassandra is highly scalable by using consistent hashing and can easily add nodes to horizontally scale using gossip protocol to do health checks - this will help us reduce complexity of our system but the other features are not enough for us to choose to use it.


because we will need to shard the database in order to horizontally scale to add more machines to increase capacity, we will need to consider the optimal sharding strategy while using consistent hashing.


since the data we are storing are based on a specific user (user and admin), we will want to use user id based sharding.


schema design:


users

primary_key | user_id | payment_id | role | last, first name | hash_password | username, index: index_by_user_id


movies

primary_key | movie_id | title | genre | duration | cast, index_by_movie_id


movie_and_theatre

primary_key | movie_id | theatre_id


theatre_and_showtime

primary_key | theatre_id | showtime_id


showtimes

primary_key | showtime_id | showtime in utc


ratings

primary_key | rating_id | rating | review | user_id | movie_id


theatres

primary_key | theatre_id | location | seating_capacity


bookings

primary_key | booking_id | user_id | booking_status (paid, refunded) | movie_id, index_by_user_id, index_by_booking_id


payments

primary_key | payment_id | user_id | stripe_payment_token, index_by_user_id


relationships

theatre to movies, one-to-many

movie to ratings, one-to-many

user to payments, one-to-many

user to bookings, one-to-many

theatre to showtimes, one-to-many


High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design...


how to scale

how to achieve high throughput

how to not lose data while processing node crashes

what to do when database is slow or unavailable


scalable = partitioning

reliable = replication and checkpointing

fast = in memory, minimize disk reads


caching stategy

cdn






Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...


  1. user --> browser --> load balancer --> api gateway --> online ticketing platform --> users service queue --> check cache to determine if user token is available --> if not found, go to user service to create an account -->



user creates an account - POST /v1/users/create_account

user sign-in and authentication- GET /v1/users/authenticate


retrieve movies, showtimes, theatres - GET /v1/movies/#{movie}/listings

retrieve seats available - GET /v1/theatre/#{theatre_name}/seats

user picks seat - POST /v1/theatre/#{theatre}/pick_seats


payments - POST /v1/payments/#{payment_type}

send booking details - POST /v1/booking_details/send_details

send change details - POST /v1/booking_details/send_change_details


user adds movie ratings - POST /v1/movies/#{movie}/add_rating

user updates movie ratings - PUT /v1/movies/#{movie}/update_rating


add movie - POST /v1/movies/add_movie

remove movie - DELETE /v1/movies/#{movie}/remove_movie

update movie - PUT /v1/movies/#{movie}/update_movie


add theatre - POST /v1/theatre/add_theatre

remove theatre - DELETE /v1/theatre/#{theatre}/remove_theatre

update theatre - PUT /v1/theatre/#{theatre}/update_theatre

set showtimes - POST /v1/theatre/#{theatre}/set_showtimes

update showtimes - PUT /v1/theatre/#{theatre}/update_showtimes




Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...


load balancer

  • because we want to achieve high throughput and handle large requests at scale, we can use HTTP load balancing (as we are dealing with HTTP requests) and use round robin to distribute the requests. we can also scale to have a cluster of load balancers rather than just one to avoid a single point of failure. having a cluster of load balancers will allow us to continue to distribute traffic even if one load balancer goes down


event-driven architecture + push/pull queue/ + check pointing + partitioning

  • because we want to increase throughput and handle load, we want to use an event-driven architecture driven by kafka to handle data processing.
  • kafka is a good choice because it has checkpointing functionality and will allow us to set specific topics to listen to to determine which service to send the request to
  • workflow: request passed through from api gateway --> online ticketing platform --> a cluster of kafka producers, goes to specific topic --> gets read from kafka consumer listening on specific topic --> goes to the service, service also listens in on zookeeper to determine which database to write to--> checks cache --> writes to database if not in cache --> writes to cache





Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...






Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.






Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?