Design An Online Payment Service - System Design

System requirements

Functional:

1. A User should be able to create an account profile

2. A user should be able to send and receive payments electronically

3. A User should be able to perform fund transers.

Non-Functional:

1. Performance - We want to achieve low-latency for a good user experience.

2. Scailability - We want to design a system that scales well and can grow as more users are added.

3. Reliability - The system should be reliable and durable for clients.

Capacity estimation

DAU 50M

payment: 100 bytes for a unique id and 400 bytes for other content such as payment to, payment from, currency, etc.

.25% send a payment a day

Up to 10 payments a day.

Payment is 500 bytes

QPS: 50 * 10^6 DAU * .25 payments * 2 scailing / 10^5 seconds in a day

5 * 10^2 * .5 = 500 / 2 = 250 QPS

Total data would be 50 * 10^6 DAU * 10 payments * 400 days in a year * 5 years * 3 copies * 500 bytes

5 * 10^12 * 1 * 4 * 5 * 3 * 5

5 * 1 * 120 * 10^12 = 6 * 10^15 = 6 PB

250 QPS

6 PB

TPS: 50 * 10^6 * .5 / 10^5 = 250

API design

Define what APIs are expected from the system...

https://createAccount (POST)

(user_id, user_name, user_password, age, location, profile_info, timestamp)

https://startPayment (GET)

(payment_to, user_id, payment_from, currency, timestamp)

https://postPayment (POST)

(payment_to, user_id, payment_from, currency, timestamp, payment_token)

https://transferFunds

(transfer_to, transfer_from, user_id, currency, timestamp)

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

user_table

id: string PK

pw_hash: string

name: string

age: string

location: string

profile_info: string

timestamp: string

index: user_id

payment_table

id: string PK

from_id: string FK

to_id: string FK

token: string

account_table:

id: string PK

balance: string

index: account_id

transaction_table

id: string PK

content: string

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

1. The user uses the API https://createAccount to create their account. First the request goes to the Load balancer, then the request goes to the API Gateway, after that it goes to the profile service. The Profile service creates a password hash and communicates with the coordinator who routes the request to the Write Database.

2. The user starts a payment via the https://startPayment (GET) API. This API communicates with the Payment service which routes the request to the Fraud detection service to check for an issue and returns a payment token to the user to continue creating the payment.

3. The user posts a payment via the https://postPayment (POST) API. The request follows a similar pattern to start payment and goes to the payment service. The request is only written to the database if the payment token has not been used already (idempotency).

4. The user initiates a payment transfer to another account via the https://transferFunds API. This api is routed to the transfer service. The transfer service splits the transaction into two parts, which is sending and receiving. The two parts are inserted into the queue and then to the transaction phase DB to track each phase of the transaction. Each phase such as deducting from one account to another is tracked in this Database.

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Payment Service:

1. The payment service creates a payment_token for the request for idempotency reasons. This is because we do not want to process a payment twice. When the user posts the payment the database saves the payment_token so if a request comes in later on with that token it doesn't write another payment to the database it instead returns the already processed payment.

Transfer Service:

1. The Transfer service receives a request to transfer funds from one account to another. This service creates two acounts, one to deduct from account 1 and two to deduct from account two.

2. Both actions are inserted into the Queue. Coordinator utilizes a pull model where it sends a request to the queue stating it's ready for the next request.

3. The request is sent to the coordinator who routes the request to the transaction phase DB. The transaction phase DB logs the action that is about to be taken.

4. A exclusive database record lock is held on account A while we update the account balance.

5. The database record lock is removed.

6. The Transaction phase DB logs the successful account deduction from account A.

7. The next action from the queue for adding to account B is performed.

8. The Transfer service utilizes the Saga algorithm rather than 2 phase commit. This is because it's more performant and scales better. The Saga algorithm treats each of our actions as independent actions and therefore only holds a database record lock for those specific actions. If an error occurs, Saga performs a rollback on the database. This is eventual consistency because there is a brief point in time where the user could see inconsistent states.

1. We use pessemistic database record locks over optomistic record locks because we want to avoid write conflicts. Due to this being a financial type of application we want to ensure our system is reliable and avoiding write conflicts helps us achieve that. Dealing write conflicts as complexity to the system because then we would have to implement a way to resolve write conflicts such as using version vectors.

2. We don't use 2 phase commit and instead use Saga because 2 phase commit has some serious issues when you scale. Because the transfer service has two actions that need to be completed for it to be deemed a successful transfer, if we used 2 phase commit we would have to wait to acquire both locks on both accounts and then wait for all the nodes to agree to commit and then wait for the nodes to actually commit the transfers. During this time we could have network connectivity issues and database record locks are more likely to be stuck in a waiting phase, and then on top of all of that it is slower and we've made a commitment to Performance in this design. Saga on the other hand is good in a micro service architecture and has a rollback feature to help mitigate any issues or failures that might occur.

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

1. We Use Single Leader Replication which is probably going to have issues once we scale to various regions accross the globe. Routing all requests accross the global to one leader database is slow and not scalable.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

1. Use multi-leader-replication and have one leader as the primary write database in every region. Although this will introduce potential write conflicts, we try to mitigate it as much as possible by keeping one leader node in every region. If write conflicts to arise we will implement version vectors on all writes. Version vectors keep a version number on every record and a version number on every database node. We use those version vectors to address write conflicts.