System requirements
Functional:
- Support account management such as linking bank accounts, view purchase history
- Support users send and receive payments to/from other users.
- Support users sending payments to merchants.
- Support multiple currencies
- Support fraud detection
- Support notifications via email, SMS, etc.
Non-Functional:
- Strong consistency in account management and fund transfers.
- Highly available (99.99%)
- Latency is not critical, although should be as soon as possible
- Scales to 1B users, 100M DAU, 10M daily transactions.
- High secure and compliant. Suport PCI DSS etc.
Capacity estimation
Storage
- Assume each transaction takes 1KB including history, 10M * 1KB = 10GB / day, or 3.6TB / year, and 18TB for 5 years.
- 1B user profile. Assume 1k each = 1TB
- Assume 10M merchants, 1k for each profile = 10GB.
Requests
- Average tps is 10M / 86400 = 100 tps
- Assume peak qps is 10 times that = 1000 tps
API design
// Account management
POST /users -> User
GET /users/
PATCH /users/
POST /merchants -> Merchant
GET /merchants/
PATCH /merchants/
// Payment
POST /payments -> Payment // could be send or request
GET /payments?q={filters}&pageSize={}&pageToken={}
GET /payments/
// Fund transfer
POST /transfers -> Transfer
Database design
User
- id (PK)
- name
- login
- ...
- paymentPreference
UserBalance
- userId (PK)
- balance
- currency
- lastUpdated
Merchant
- id (PK)
- name
- login
- ...
- accountInfo
UserPaymentAccounts
- userId (FK -> User)
- accountType // bank | creditCard
- accountInfo
Payment
- id (PK)
- type // payment | request
- amount
- currency
- merchantId (FK -> Merchant)
- userId (FK -> user)
- status
- ...
Transfer // balance -> balance transfer)
- id (PK)
- amount
- currency
- merchantId (FK -> Merchant)
- userId (FK -> user)
- status
- ...
Transaction
- id (PK)
- paymentId (FK -> Payment)
- paymentType // balance | paymentAccountId
- createTime
- updateTime
- status
TransactionLog
- transactionId (FK -> Transaction)
- timestamp
High-level design
User and Merchant have separate clients
PaymentService handles user payment to merchants
TransferService handles user to user transfers
Payments are queued, and processed by PaymentProcessor. It adds transactions to a transaction queue, which is processed by TransactionProcessor.
TransactionProcessor calls PaymentNetwork asynchronously. It registers a callback so PaymentNetwork can report status.
All services reports to DB to record payment, transfer, transaction and their logs.
Payment and Transfer services pushes messages to NotificationQueue, which is processed by NotificationService and sends to client.
Request flows
Account management flow is standard CRUD operation against DB.
Payment request:
- Merchant sends a payment request to Payment service, which get registered in the DB.
- Payment service pushes a notification to notification queue. User gets a notification to pay, and opens the client app to pay the payment.
- User sends a request to PaymentService to pay.
- A message gets added to PaymentQueue via CDC.
- PaymentProcessor looks at the Transaction table and notice there are no transactions in progress for the payment. It creates a transaction in the DB.
- A message is added to TransactionQueue via CDC.
- TransactionProcessor looks at the status of the transaction (which is unprocessed), then calls PaymentNetwork asynchronsly, and updates the Transaction and TransactionLog table.
- PaymentNetwork periodically calls back and pushes message to PaymentNetworkQueue.
- TransactionProcessor gets the message, records it in the log, and checks if transaction is complete.
- If complete, it sends a message to PaymentProcessor via PaymentQueue. PaymentQueue updates the Payment, and sends a notification.
- If the transaction fails, it still sends the message to PaymentQueue. PaymentProcessor determines if retry is needed, and the retry strategy.
Transfer
- User initiates a transfer by calling TransferService. TransferService records the transfer in unprocessed state, then enqueues a message in TransferQueue.
- TransferProcessor processes the transfer, records the result in DB, retry if needed, and sends a notification.
Detailed component design
TransferProcessor
- It deducts the transfer amount from user1's balance, and adds the same amount to user2's balance.
- This an be done in one transaction, subject to validation.
PaymentProcessor
- To process a payment, multiple transaction might be needed (e.g. a transaction failed and need to be retried).
- It is responsible to create new transactions, and determine retry strategy (e.g. a limit with exponential backoff)
TransactionProcessor
- PaymentNetworks are inherently async, and messages could arrive out of order or lost)
- TransactionProcessor needs to reconcile the message coming from PaymentNetwork.
- It should also periodically (via event driven message) to make sure the transaction is alive. Otherwise it should report the transaction to be failed. The idempotency of the transaction is guaranteed by the transaction id.
- When transaction status is undetermined (e.g. message lost in Payment Network), the PaymentProcessor should be automatically retry, but send the transaction to a separate queue for investigation.
Trade offs/Tech choices
We uses queues to decouple the services due to the asynchronous nature of payment processing, and to improve reliability of the system.
Fund transfers can be processed in one database transactions, therefore we simply do this in one service.
We store all transaction history in DB. This is needed in order to reconcile asynchronous transactions, and for auditing purposes.
For the DB, we choose a ACID relational database for its strong consistency. We need to shard it (see next section).
Failure scenarios/bottlenecks
All services are stateless and can be scaled horizontally.
The database cannot be naively shared since payments and transfers may be may transacted between any 2 users and merchants.. Storage size is not a problem. Assume each payment generate 10 database transactions, 10k tps is on the edge of a well-tuned database.
- One idea is to have a separate DB for balances, and synchronize user accounts across the 2 database. Since user account changes are less frequent, this may be a good option.
- If we do this, then we can shard the payment portion of the database based on payment id.
Kafka queues: assume each transaction needs 10 messages, our 1k peak qps or 10k message/sec can be easily handled by 1 partition. We can create 3 partitions so we have redundant capacity.
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?