Codemia | Master System Design Interviews Through Active Practice

My Solution for Design an Online Chess Service with Score: 7/10

by utopia4715

System requirements

Functional:

I will skip user sign up.

Pair players: User signs up for a time slot and asks to play at a certain level. System finds a matching player and sends an invite links to both players.

Move updates: User selects a move on a GUI. System validates the move, updates the other player's board, and declares a winner if one player won.

For now, I won't consider storing a history of a game.

Non-Functional:

Response time is important. After a player makes a move, it should take less than 2 seconds for the other player to see the update.

The system should be able to function even if network goes offline for a short amount of time.

Multiple games can be played. Let's say 1 million games at a peak time.

Consistency. It is vital that both players see the same board all the time.

Capacity estimation

Each game can be represented by:

8x8 matrix to present the board
Metadata like start time, end time, level
Players playing

1KB probably would be sufficient to store these.

1M games * 1KB = 1GB of data.

We need to store data for pairing as well:

Time slot
User IDs
Level

These are similarly small.

Because the data size is small and consistency is important, I think relational DB makes sense as the primary data store.

API design

request_game(user_ID, date_time_slot, level)

By using this API, a user indicates that they want to play the game, at this time slot, at that level.

It does not immediately return game information because there may not be a good match at the time of request.

When the system finds a match, the system schedules a game, and sends a URL link to the two matched players in email. It contains game ID and which game server the users will be connected to.

join_game(user_ID, game_URI)

game_URI has encoded information for the game ID and the game server they should connect to.

With in WebSocket connection,

When one player makes a move, client sends the following to the server:

move(user_ID, piece, start_location, destination_location)

The server, after validating the move, sends the board state to the client:

board_state(8 x 8 board representation)

Database design

I will write about critical tables:

Game:

game_id
board_state # JSON object representing the board
start_time
end_time
winner
number_of_moves

User:

user_id
name
email

Game_User: # this is the join tables which tells us which players are playing which game.

game_id
user_id

Request:

user_id
date_time_slot
level
status # already matched or not

A couple of decision point:

I would represent the board state (8 x 8 matrix of which pieces are where) in JSON format instead of implementing it as a table (e.g. 64 rows representing 64 spaces) because the board state is not used for indexing. This would simplify the data model

We could store two user IDs in Game table. This would let us avoid join table. However, doing so would make a common query "list all the games this user played" slow. So I opted for a join table.

High-level design

There will be two microservices:

Management Service, which receives requests for a game, creates a game, matches players, and sends notification.
Game Service, which connects with clients via WebSocket and executes the game.

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Player Matching

One of the key components of this system is how to match players.

When Player 1 requests a game, Management Service would put the request, including the game level and time slot, in a message queue.

When Player 2 requests a game, Management Service first checks if there is a player in the message queue who would match Player's request in terms of level and time.

If there is, these two players are matched, and Player 1 is removed from the queue.

If there is not, Player 2 is added to the queue.

The queue should be partitioned by levels and time slots. For example, date and time slot can be modeled as a topic. This helps scalability by keeping the queue size smaller, making the query less expensive, and partitioning of the message queue server possible.

This would address some matching conflict situations:

Player never getting a match: Because this is queue based, the earliest requester gets a chance for a match first. This ensures fairness and avoids a player never getting a match.
Synchronization issues, e.g., two players requesting at the same time. The message queue acts as a sequencing mechanism, avoiding synchronization issues such as duplicate matches or two players not getting matched.

If Player 1 wants to play at a certain level (say 5), but if no matching Player showed up in time (let's say until 5 minutes before the game), we would expand the level criteria (e.g. maybe level 4 - 6 would be acceptable). I would imagine most players want to play, even against a player with slightly different level.

Move validation

When a move is made by Player 1, the client calls move(piece, start_location, destination_location) message via WebSocket.

Game Service makes sure this move is legal (each piece has a set of moves that are allowed). The system takes the type of the piece, starting location, and the destination as an input (e.g. a pawn on location A moves forward by 1 spot, or 2 spots, or makes a diagonal move to take the opponent's piece), and determines this legality. A care has to be taken for special rules such as castling.

If the move is illegal, return an error. If the move is legal, the system updates the board state, and send it to the clients.

Trade offs/Tech choices

Client-server communication during a game can be done by web socket, long polling, or polling. Polling would require too many requests to server and result in long response time. Both web socket or long polling would work, as they both give short response time and conserve resources on the servers. Long Polling requires more requests than web socket, because it needs to reconnect once in a while. Web Socket consumes more memory than long polling. I would first try web socket, measure memory and performance, and confirm it is a good approach, or we should try long polling.

Redis cache is employed to store oft-requested data (e.g. rows in Game table) can be accessed quickly by the microservices. Also, it would be beneficial to back up websocket connection for fault tolerance.

Game server should be selected to minimize the user response time. For example, if two players are in the same region, a game server in a data center that is close to the players should be chosen. If two players are in different regions, we can estimate the response time from each data center, and choose a data center which would provide reasonable response time to both players.

Failure scenarios/bottlenecks

When a client of a player (let's say Player 1) becomes offline from the server for a short amount of time, we'd like to make sure the players can keep playing.

Player 1 can still view the board and make a move. To do so, the client has to store the board state (8 x 8 board) and store the move the player makes.

Once the network (therefore the WebSocket connection) is re-established, the last move is propagated to Player 2's client.

During Player 1 is disconnected, the GUI for both players should display a message "Player 1 is disconnected. We will resume the game when they come back online."

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?