Design Ticketmaster - System Design

System requirements

Functional:

List functional requirements for the system (Ask interviewer if stuck)...

users:

customers

purchase movie tickets
create an account

admins

ticket inventory, pricing, promotion
add/remove listings, theatres, showtimes

customer create an account and login

admins have their own special panel

features:

user registration and authentication
display a list of available movies, showtimes, theatres
genre, casting, ratings, trailers
users can pick seats in theatres
update seat availability in real time
payment processing for ticket purchase
support different payments
send confirmation with booking details
notify customers of changes or cancellations
customers can add movie ratings and add reviews
display overall rating for each movie
movie management - add movies, update existing movies, set showtimes
theatre management - add new theatres, edit existing theatres, update showtimes at each theatre

Because each set of actions can be encompassed by a category, they will be separated by microservices and live in each respective microservice. For example, payment-related actions will live in the payments service.

Non-Functional:

List non-functional requirements for the system...

non-functional requirements

needs to be fault tolerant for both users and admins at any given time
real time write updates - writes needs to be fast
read delays are minimal - reads also need to be fast
low latency
peaks will increase reads/writes
scale dynamically

the system needs to be able to scale dynamically as there are more traffic so we will need to look at things like horizontal scaling, reads needs to be fast since there are more reads than writes so we can consider caching reads for more popular movies and theatres, and it needs to be highly performant and available to users and admins to provide a seamless experience. because there can be high peak traffic, we want to ensure that the load is also evenly distributed through load balances that can be scaled and the load balancers will handle distributing traffic to ensure that no single server is overwhelmed causing application degradation. because we want to ensure that admins and users are routed properly, we will also need an api gateway. to ensure high performance, we need to ensure that the application survives hardware faults and network failures and that there are no single points of failures. since we need highly available data but also highly consistent data, we can look at the CAP theorem - we cant have both so we will need to make tradeoffs. this will depend on the data model but we can also utilize consistent hashing to help with load and availability.

given that we will also need to host trailers and they are video media, we want to also consider using a CDN to host and serve files. since we need low latency, we will want to have multiple CDNs that serve content to users based on their locale.

in order to ensure high performance and fault tolerance, we will also need to consider monitoring and logging with tools such as elastic search, logstash, and kibana and monitoring such as grafana.

because each action is an event, we will also want to consider an event driven architecture using apache kafka to handle decoupling of actions and services.

Capacity estimation

Estimate the scale of the system you are going to design...

user traffic - peak during weekends or holidays

traffic spikes during movie releases or special events

1000 users concurrently during peak hours

50 admins/day

10,000 reads / hour during peak

500 writes / hour throughout the day

updates within milliseconds - real time updates

API design

Define what APIs are expected from the system...

user creates an account - POST /v1/users/create_account

user sign-in and authentication- GET /v1/users/authenticate

retrieve movies, showtimes, theatres - GET /v1/movies/#{movie}/listings

retrieve seats available - GET /v1/theatre/#{theatre_name}/seats

user picks seat - POST /v1/theatre/#{theatre}/pick_seats

payments - POST /v1/payments/#{payment_type}

send booking details - POST /v1/booking_details/send_details

send change details - POST /v1/booking_details/send_change_details

user adds movie ratings - POST /v1/movies/#{movie}/add_rating

user updates movie ratings - PUT /v1/movies/#{movie}/update_rating

add movie - POST /v1/movies/add_movie

remove movie - DELETE /v1/movies/#{movie}/remove_movie

update movie - PUT /v1/movies/#{movie}/update_movie

add theatre - POST /v1/theatre/add_theatre

remove theatre - DELETE /v1/theatre/#{theatre}/remove_theatre

update theatre - PUT /v1/theatre/#{theatre}/update_theatre

set showtimes - POST /v1/theatre/#{theatre}/set_showtimes

update showtimes - PUT /v1/theatre/#{theatre}/update_showtimes

within API design, we will also need to consider actions such as rate limiting, caching, authentication, and validation. these are actions that are going to be deferred to the API gateway. we have the option of building our own API gateway or purchasing an out-of-the-box, third party solution. there are pros and cons to each. if we have the man power, building our own api gateway will allow us to be flexible and to tailor the needs specific to our solution whereas a third party solution may not be as flexible, could have less complexity due to less features, but will save on resources.

when rate limiting, we want to consider the read/write usage of users and how they cause traffic spikes during peak hours. since we have 1,000 users concurrently during peak hours with 10,000 reads per hour during the peak, we can assume that each user during peak hours are making 10 reads per hour. because we want to avoid bots hitting us all at once or purchasing all the tickets at once, we want to limit this by using a gradual rate limiting rather than abruptly which can upset users. a sliding window approach will allow for a burst of activity with a specific time interval. we will inform the user when they are rate limited.

in our api gateway, we will also handle authentication and security but we will send our user information to the user service to determine if the user is valid, needs to create an account, or is able to log in. we can send tokens to signify the authentication result. the api gateway will also validate all incoming requests.

not only will this ensure a safe and fair access to the platform, by having authentication and rate limiting, we ensure that there is high availability within the system and that it is fault tolerant to high traffic. this helps with the overall scalability initiative.

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

data model:

users

customers who create accounts to purchase tickets
admins who manage movies, theatres, and bookings

movies

title, genre, duration, cast
relationship to theatres and showtimes

theatres

location, seating capacity
relationship to movies and showtimes

bookings

records of ticket reservations made by users for specific movies and showtimes
seat selection, payment status, booking timestamps

ratings and reviews

user-generated reviews and ratings for movies
relationship between users, movies, and ratings/reviews

based on the data model, there is a one-to-many relationship for the data entities. given this one-to-many relationship, we want to use a relational database such as mysql to handle the structured data and complex relationships. if we were to build the application in a server-side framework such as ruby-on-rails, we could utilize their extensive activerecord libraries in order to interact directly with the sql database.

since we are choosing mysql as our de-facto database, we want to consistency, availability, scalability, data integrity, replication, partitioning, multi-data center, leader election, load distribution (consistent hashing)

we need a highly peformant database given the speed of responses needed, high availability (no single point of failures, able to handle hardware or network failures), and scalable (many users during peak times)

because admins are in charge of adding and removing movie and theatre information, we don't expect any data to expire. this means that the TTL is infinite and we will be writing each event object. this means we will have fast writes and we can slice and dice the data however we need to.

consistency

because users are purchasing tickets and expecting accurate information, the data needs to be highly consistent. we expect all users to see the same information in real time so they are able to make informed decisions. this is important as a part of user experience.
we also need to provide transactional support and ACID properties in order to ensure data integrity

availability

the database needs to be highly available even through network issues or hardware failures
because this is a tradeoff between consistency and availabilty, we will focus more on consistency. this means that there can be some delay due to data replication to ensure that data is accurate

fault tolerance

we expect the database to function even when we are having hardware issues - this means that we should have multi-data center set up such that if one datacenter goes down, we should have a backup ready to serve data. we will also set up our leader election such that we use master-slave leader set up. this means that we will have one master and if the master goes down, we can elect a slave to be the master until the master returns. we can also have multiple data centers closest to the users locale in order to reduce latency. since we will be experiencing more reads than writes, we will also set up to have multiple read replicas to handle read traffic.

scalability

when considering scaling, we can think about sharding, multi data centers, and read replicas.
because we may be writing a lot of data, when we are scaling up our mysql database, we will need to shard it. sharding is the concept of horizontal scaling when we are adding additional nodes to increase storage space. when we are sharding the database, there are other factors we need to consider. we will need additional services such as a service discovery service (such as zookeeper) that will enable us to know when database nodes are available and healthy, we will also need to ensure that we have a shard proxying service that handles routing the calls to the correct shard, and a primary id generator and coordinator. this will ensure that we do not have any collisions between primary ids in each shard. we can utilize snowflake id algorithm to do so.
we will also need a shard proxy that will handle caching any query results, monitor db health, and terminate queries
by adding additional read replicas, writes will go to master but we can read from either master or the read replicas.
within the shard proxy, we can also add a bloom filter to enable fast lookups in the cache.

load distribution

we can use consistent hashing with our mysql database to ensure that load is evenly distributed among each node.
consistent hashing is a good option because when there is a node that goes down, we wont have to rehash all the data limiting the amount of data migration that happens during redistribution. this also means that only a subset of the data will be affected when a node goes down.
consistent hashing will also help us avoid hot and cold partitions and will allow us to redistribute data evenly to avoid hot spots.

replication

because we will have multi data centers, we should use multi leader replication in order to replicate between multi data centers. this will allow us to use the backup data center in case the original data center goes down
by waiting for replication to complete, we will be able to ensure that there is no data loss however, latency may increase and availability may suffer

overall, this setup helps us avoid a single point of failure which would be a bottleneck.

this database set up (using mysql) was chosen over a nosql database such as cassandra because:

the data is relational and should be stored in a normalized manner, cassandra prefers query-based data models and is comfortable with denormalized data
cassandra is eventually consistent but we need strong consistency within our data
cassandra is highly scalable by using consistent hashing and can easily add nodes to horizontally scale using gossip protocol to do health checks - this will help us reduce complexity of our system but the other features are not enough for us to choose to use it.

because we will need to shard the database in order to horizontally scale to add more machines to increase capacity, we will need to consider the optimal sharding strategy while using consistent hashing.

since the data we are storing are based on a specific user (user and admin), we will want to use user id based sharding.

schema design:

users

movies

movie_and_theatre

primary_key | movie_id | theatre_id

theatre_and_showtime

primary_key | theatre_id | showtime_id

showtimes

primary_key | showtime_id | showtime in utc

ratings

theatres

primary_key | theatre_id | location | seating_capacity

bookings

primary_key | booking_id | user_id | booking_status (paid, refunded) | movie_id, index_by_user_id, index_by_booking_id

payments

primary_key | payment_id | user_id | stripe_payment_token, index_by_user_id

relationships

theatre to movies, one-to-many

movie to ratings, one-to-many

user to payments, one-to-many

user to bookings, one-to-many

theatre to showtimes, one-to-many

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design...

how to scale

how to achieve high throughput

how to not lose data while processing node crashes

what to do when database is slow or unavailable

scalable = partitioning

reliable = replication and checkpointing

fast = in memory, minimize disk reads

caching stategy

cdn

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

user --> browser --> load balancer --> api gateway --> online ticketing platform --> users service queue --> check cache to determine if user token is available --> if not found, go to user service to create an account -->

user creates an account - POST /v1/users/create_account

user sign-in and authentication- GET /v1/users/authenticate

retrieve movies, showtimes, theatres - GET /v1/movies/#{movie}/listings

retrieve seats available - GET /v1/theatre/#{theatre_name}/seats

user picks seat - POST /v1/theatre/#{theatre}/pick_seats

payments - POST /v1/payments/#{payment_type}

send booking details - POST /v1/booking_details/send_details

send change details - POST /v1/booking_details/send_change_details

user adds movie ratings - POST /v1/movies/#{movie}/add_rating

user updates movie ratings - PUT /v1/movies/#{movie}/update_rating

add movie - POST /v1/movies/add_movie

remove movie - DELETE /v1/movies/#{movie}/remove_movie

update movie - PUT /v1/movies/#{movie}/update_movie

add theatre - POST /v1/theatre/add_theatre

remove theatre - DELETE /v1/theatre/#{theatre}/remove_theatre

update theatre - PUT /v1/theatre/#{theatre}/update_theatre

set showtimes - POST /v1/theatre/#{theatre}/set_showtimes

update showtimes - PUT /v1/theatre/#{theatre}/update_showtimes

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

load balancer

because we want to achieve high throughput and handle large requests at scale, we can use HTTP load balancing (as we are dealing with HTTP requests) and use round robin to distribute the requests. we can also scale to have a cluster of load balancers rather than just one to avoid a single point of failure. having a cluster of load balancers will allow us to continue to distribute traffic even if one load balancer goes down

event-driven architecture + push/pull queue/ + check pointing + partitioning

because we want to increase throughput and handle load, we want to use an event-driven architecture driven by kafka to handle data processing.
kafka is a good choice because it has checkpointing functionality and will allow us to set specific topics to listen to to determine which service to send the request to
workflow: request passed through from api gateway --> online ticketing platform --> a cluster of kafka producers, goes to specific topic --> gets read from kafka consumer listening on specific topic --> goes to the service, service also listens in on zookeeper to determine which database to write to--> checks cache --> writes to database if not in cache --> writes to cache

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?