System requirements


Functional:

  • Support account management such as linking bank accounts, view purchase history
  • Support users send and receive payments to/from other users.
  • Support users sending payments to merchants.
  • Support multiple currencies
  • Support fraud detection
  • Support notifications via email, SMS, etc.


Non-Functional:

  • Strong consistency in account management and fund transfers.
  • Highly available (99.99%)
  • Latency is not critical, although should be as soon as possible
  • Scales to 1B users, 100M DAU, 10M daily transactions.
  • High secure and compliant. Suport PCI DSS etc.



Capacity estimation

Storage

  • Assume each transaction takes 1KB including history, 10M * 1KB = 10GB / day, or 3.6TB / year, and 18TB for 5 years.
  • 1B user profile. Assume 1k each = 1TB
  • Assume 10M merchants, 1k for each profile = 10GB.


Requests

  • Average tps is 10M / 86400 = 100 tps
  • Assume peak qps is 10 times that = 1000 tps



API design

// Account management

POST /users -> User

GET /users/ -> User

PATCH /users/


POST /merchants -> Merchant

GET /merchants/ -> Merchant

PATCH /merchants/


// Payment

POST /payments -> Payment // could be send or request

GET /payments?q={filters}&pageSize={}&pageToken={}

GET /payments/


// Fund transfer

POST /transfers -> Transfer



Database design

User

  • id (PK)
  • name
  • login
  • ...
  • paymentPreference


UserBalance

  • userId (PK)
  • balance
  • currency
  • lastUpdated


Merchant

  • id (PK)
  • name
  • login
  • ...
  • accountInfo


UserPaymentAccounts

  • userId (FK -> User)
  • accountType // bank | creditCard
  • accountInfo


Payment

  • id (PK)
  • type // payment | request
  • amount
  • currency
  • merchantId (FK -> Merchant)
  • userId (FK -> user)
  • status
  • ...


Transfer // balance -> balance transfer)

  • id (PK)
  • amount
  • currency
  • merchantId (FK -> Merchant)
  • userId (FK -> user)
  • status
  • ...


Transaction

  • id (PK)
  • paymentId (FK -> Payment)
  • paymentType // balance | paymentAccountId
  • createTime
  • updateTime
  • status


TransactionLog

  • transactionId (FK -> Transaction)
  • timestamp


High-level design

User and Merchant have separate clients


PaymentService handles user payment to merchants


TransferService handles user to user transfers


Payments are queued, and processed by PaymentProcessor. It adds transactions to a transaction queue, which is processed by TransactionProcessor.


TransactionProcessor calls PaymentNetwork asynchronously. It registers a callback so PaymentNetwork can report status.


All services reports to DB to record payment, transfer, transaction and their logs.


Payment and Transfer services pushes messages to NotificationQueue, which is processed by NotificationService and sends to client.





Request flows

Account management flow is standard CRUD operation against DB.


Payment request:

  • Merchant sends a payment request to Payment service, which get registered in the DB.
  • Payment service pushes a notification to notification queue. User gets a notification to pay, and opens the client app to pay the payment.
  • User sends a request to PaymentService to pay.
  • A message gets added to PaymentQueue via CDC.
  • PaymentProcessor looks at the Transaction table and notice there are no transactions in progress for the payment. It creates a transaction in the DB.
  • A message is added to TransactionQueue via CDC.
  • TransactionProcessor looks at the status of the transaction (which is unprocessed), then calls PaymentNetwork asynchronsly, and updates the Transaction and TransactionLog table.
  • PaymentNetwork periodically calls back and pushes message to PaymentNetworkQueue.
  • TransactionProcessor gets the message, records it in the log, and checks if transaction is complete.
  • If complete, it sends a message to PaymentProcessor via PaymentQueue. PaymentQueue updates the Payment, and sends a notification.
  • If the transaction fails, it still sends the message to PaymentQueue. PaymentProcessor determines if retry is needed, and the retry strategy.


Transfer

  • User initiates a transfer by calling TransferService. TransferService records the transfer in unprocessed state, then enqueues a message in TransferQueue.
  • TransferProcessor processes the transfer, records the result in DB, retry if needed, and sends a notification.


Detailed component design

TransferProcessor

  • It deducts the transfer amount from user1's balance, and adds the same amount to user2's balance.
  • This an be done in one transaction, subject to validation.


PaymentProcessor

  • To process a payment, multiple transaction might be needed (e.g. a transaction failed and need to be retried).
  • It is responsible to create new transactions, and determine retry strategy (e.g. a limit with exponential backoff)


TransactionProcessor

  • PaymentNetworks are inherently async, and messages could arrive out of order or lost)
  • TransactionProcessor needs to reconcile the message coming from PaymentNetwork.
  • It should also periodically (via event driven message) to make sure the transaction is alive. Otherwise it should report the transaction to be failed. The idempotency of the transaction is guaranteed by the transaction id.
  • When transaction status is undetermined (e.g. message lost in Payment Network), the PaymentProcessor should be automatically retry, but send the transaction to a separate queue for investigation.


Trade offs/Tech choices

We uses queues to decouple the services due to the asynchronous nature of payment processing, and to improve reliability of the system.


Fund transfers can be processed in one database transactions, therefore we simply do this in one service.


We store all transaction history in DB. This is needed in order to reconcile asynchronous transactions, and for auditing purposes.


For the DB, we choose a ACID relational database for its strong consistency. We need to shard it (see next section).


Failure scenarios/bottlenecks

All services are stateless and can be scaled horizontally.


The database cannot be naively shared since payments and transfers may be may transacted between any 2 users and merchants.. Storage size is not a problem. Assume each payment generate 10 database transactions, 10k tps is on the edge of a well-tuned database.

  • One idea is to have a separate DB for balances, and synchronize user accounts across the 2 database. Since user account changes are less frequent, this may be a good option.
  • If we do this, then we can shard the payment portion of the database based on payment id.


Kafka queues: assume each transaction needs 10 messages, our 1k peak qps or 10k message/sec can be easily handled by 1 partition. We can create 3 partitions so we have redundant capacity.


Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?