Design An Online Payment Service - System Design

System requirements

Functional:

List functional requirements for the system (Ask the chat bot for hints if stuck.)...

Users can send and receive payments electronically
Allows for account management
Allows fund transfers and payment processing
Has fraud detection
has buyer and seller protection
supports multi currency transactions

Non-Functional:

List non-functional requirements for the system...

Scalability- application should scale in and out and be deploy-able globally
Availability- application should be highly available with minimal downtime
Reliability- application should be highly reliable for transactions avoiding data corruption and making sure transactions are routed properly
Security- financial data should be highly protected, with protection against scammers and illegal transactions
Performance- transactions should take <5 seconds and account editing should take <1 second

Capacity estimation

User and Session Scale

Let's say 200 users a second for transactions, 100 for changing account details,

Data Volume

~5GB storage for user accounts
~5GB for transaction history per day, this should be analyzed real time for fraud detection

API design

Define what APIs are expected from the system...

This API calls will be

completeTransaction(payee_id, payer_id, transaction_time, transaction_amount)- generates a transaction_id storing it in the database, updating the cache, and sending the transaction to Kafka
createAccount(username, name, email, account status)- generates a user_id at store the account in database as well as updates the cache
updateAccount(user_id, thing being changed)- could use method overloading i.e. compile time polymorphism to have multiple API calls with the user_id and the thing being changed so only that needs to be updated
viewAccountTransactions(user_id, date_range_start, date_range_end)- returns a list of transactions for the user in a given data range
deleteAccount(user_id) - removes the account from the system
viewAccount(user_id)- returns account details and settings from cache or database

All work with REST communication because the simple synchronous request response nature is well suited in addition being stateless makes it easier to scale

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

Data model- structured data in relational databases for users and transactions to ensure and maintain ACID properties of financial data and less structured key value pairs for Fraud analysis data or logging data

Schema

User

Title USER DATA
primary key - user_id
fields- username, name, email, account status

Transaction

Title - TRANSACTION DATA
primary key transaction_id
field- user_id, transaction_amount, transaction_time, payee_id, payer_id, transaction status

Fraud Analysis

Title- ANALYSIS_DATA
primary key- analysis_id
fields- transaction_id, risk_score, analysis details

Logging

Title ALERT_LOG
primary key- alert_id
fields- user_id, transaction_id, timestamp, type_id

Primary key partitions by most frequently accessed key, can do sharding to distribute databases by having instance handle a discrete range of ids. Amazon Aurora PostgreSQL would be a good choice for the User and transaction databases, and AWS DocumentDB with MongoDB for the non-relational databases. These will work with a caching layer, something like Elasticache or Redis to reduce read operations

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

Key Components/Interactions

Application Load Balancer to distribute requests and handle auto-scaling
API Gateway to act as entry point for all requests routing to proper services and handling authentication and rate limiting
Payment service- captures the transaction data pulling from a cache to store user credential and session data, sends it to the user database, the transaction database, and to Kafka, also determines if notifications should reach the user (with credential and session data already stored better for determining if notification should reach user), also converts currencies if needed
Kafka- stream data into a data processing service and a data warehouse for historical analysis, send data to alert service if Sagemaker finds fraud
Data Processing Service- enhances data by retrieving additional context from a NoSQL database or a transaction cache for better fraud analysis before passing data to Kafka
Alerts Service- gets alerts from Kafka and logs them as well as passing on the alert to the payment service to determine if it should return to the user
Sagemaker Model- is fed enriched data from the data processing service and return analysis results to Kafka
Transactional Database- stores transaction records from payment service ensuring data integrity and transactional consistency, used for auditing and historical analysis
User Database- holds details and preferences of user accounts, interacts with the payment service for profile and authentication data
Data Warehouse- data from Kafka is stored her as long-term storage for historical data analytics often used in training and updating machine learning models
NoSQL Database provides a flexible storage solution to help with quick looks-up operations for data enrichment during processing
Log Database- captures system logs, errors and alerts for monitoring and troubleshooting from the alert service
User Cache- helps the payment server capture transaction data faster by storing user credentials and session data
Transition cache- cache transactions for quick retrieval by the data processing service
Account Management service- takes in request from API Gateway to perform CRUD operations on accounts updating cache and database

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Analysis Database -

Overview- will use AWS DocumentDB with MongoDB. Having it in the AWS system allows for seamless integration into the other AWS components.
Scalability- the integration with AWS components allows for auto-scaling and global deployment, can also be used in combination with edge locations to ensure that average access time is kept low globally
Data Consistency- can use eventual consistency since this database is not the database for auditing and this will increase speed and availability
Data Structures- key value pairs allow for rapid read/write access, in addition MongoDB allows for more complex queries and secondary indexing which will help speed up data enriching to make make alerts as quick as possible

Load Balancer

Overview- will use AWS Application Load balancer which manages SSL terminations and serves as a proxy to offload the burden of handling HTTPS from back-end services
Algorithms- will perform regular health checks on API Gateways to ensure functionality and replace unhealthy instances, uses auto-scaling groups to scale in/out and as needed
Session persistence- can use session stickiness to direct subsequent request from the same client, can also be useful in returning alerts
Security- Integrates with security groups or firewalls to control things like IP address or protocols
Logging and Monitoring- can integrate logging and monitoring tools like CloudWatch to gain insights into traffic patterns, system health, and enable proactive management

Transaction DB

Overview- the transaction Database uses Amazon Aurora with a Postgres SQL database. The transactional data and need for string ACID principles to make sure that database is auditable makes SQL a good choice. Amazon Aurora allows for seamless integration with the other AWS components and tools. Postgres is being used in this case because it will allow for individual lines to be recalled which will make auditing easier.
Scalability- this DB type is optimized for read/write scalability and supports high-throughput transaction processing and real-time financial operations which is ideal for payment systems.
Data Consistency- the relational SQL nature strongly supports consistency as well as the other ACID principles
Data structures- primarily uses heap structure to store table data where each table is a heap and each row is a tuple, for indexes balanced trees are used by default to maintain sorted order to allow fast retrieval of rows by key values

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

ID generation- ID generation could be done with hashing or a random number generator. Hashing would provide consistent mapping and use less compute power but cause more collision and need additional collision strategies to handle. Even though random number generation requires compute power modern systems can provide efficient number generation and it will significantly reduce collisions, and any collisions that happen can just have the ID re-generated. There would need to be some additional work done to make sure that the IDs generated are being evenly spread across how the databases are sharded but that would just be generating withing alternating windows of ranges.

Load Balancer- Amazon ALB is used for this system because it benefits from the Layer 7 routing, SSL termination and proxy, and content-based routing to multiple service. In addition it benefits from integration in the AWS ecosystem.

SQL Data bases- an amazon product was chosen because with all components in the AWS ecosystem there is better integration and support, Google Spanner may provide more consistency with its TrueTime API but the benefits of an integrated ecosystem outweigh that gain, especially because Amazon Aurora also strongly supports the ACID principles needed for these databases.

NoSQL DB- DocumentDB was chosen, again there is the AWS integration as well as the ability to utilize MongoDB. Since the data processing service may need to pull data by a number of different fields in complex ways and needs to do so quickly, the complex query support of MongoDB was ideal for this application to quickly retrieve a mix of data to enrich before sending to Sagemaker. Also consistency can be more relaxed here so SQL would be unnecessary and the highly tunable consistency of something like Apache Cassandra would be less relevant

Cache- for the cache I would lean AWS Elasticache again for ease of integration with the rest of the AWS ecosystem, with the exception of possible regulations in certain regions requiring more granularity in configuration in which case something like Redis may be better

API Gateway vs. Direct service invocation

Pros- offers centralized control over incoming requests, easy implementation of cross-cutting concerns (e.g. authentication and rate limiting)
Cons- potential single point of failure if not properly managed
Decision- API Gateway is used to standardize entry points for security and manageability with measures (like load balancing and auto-scaling) to ensure high availability

Session management Stateless vs Stateful

Stateless
- Pros- easier to scale
- Cons- does not support conversational paradigms
- Decision - stateless services are preferred for scaling ease with state information managed in external stores (e.g. cookies and databases if needed)

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Service Overload
- Scenario- sudden spikes in traffic overload the system
- Mitigation- use auto-scaling for API gateway and server instances and ensure load balancers are effectively distributing traffic
Cache Failure
- Scenario- Caching service becomes unavailable leading to increased load on the database
- Mitigation- deploy multiple instances of each caching services making it stateless wherever possible to enable easy instance recovery and load balancing
Database Failure
- Scenario- Database server experiences downtime or crashes affecting URL retrieval and creation
- Mitigation- use distributed databases with replication and fail-over support. Incorporate read replicas to distribute and read load and have hot standbys for quick fail-over
Network Partition
- Scenario- a communication failure between data centers or between services affects data consistency
- Mitigation- Design the system using protocols that handle network partitions gracefully (CAP theorem)
Server Crash
- Scenario- an individual service instance crashes due to server failure or application bugs
- Mitigation- deploy multiple instances of each service, making it stateless wherever possible to enable easy instance recovery and load balancing
Data Corruption
- Scenario- data corruption in the database leads to invalid URL mappings
- Mitigation - implement comprehensive backup and restore strategies, write data integrity checks and use transactions where strong consistency is needed

Bottlenecks

Database write Latency
- Bottleneck- high write demands during URL creation can slow down the service
- Solution- use a message queue for write requests to buffer and gradually process incoming requests allowing the system to handle bursts of write operations
Cache Capacity
- Bottleneck- limited cache size can result in frequent cache misses increasing database load
- Solution- optimize cache eviction policies and expand cache capacity horizontally by distributing over several nodes
Network Latency
- Bottleneck- high network latency between services especially if they are geographically dispersed
- Solution- Use CDNs for regional caching and peer replication across multiple data centers to reduce average latency
API Gateway Load
- Bottleneck- API Gateway is a bottleneck if not sufficiently scalable
- Solution- implement a highly available API Gateway setup with load balancing and fail-over to distribute incoming traffic effectively

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

Implement CloudFront for lower global latency and better global availability

Machine Learning for Traffic Prediction

Improvement- use machine learning to predict traffic bursts and intelligently manage resources
Benefit- efficient allocation of system resources based on predictive models can reduce costs and improve service availability under unexpected loads

Increased Interoperability

Improvement- integrate with more third-party services for (i.e. social media platforms, content management systems) for seamless online payment
Benefit- enhances convenience and adoption as users can pay directly within other platforms they are using

Mobile Application Integration

Improvement- develop mobile apps with integrated URL shortening and management features
Benefit- provides convenience for users on-the-go, allowing them to manage and analyze links more effectively