System requirements


Functional:

List functional requirements for the system (Ask the chat bot for hints if stuck.)...

  1. Users can send and receive payments electronically
  2. Allows for account management
  3. Allows fund transfers and payment processing
  4. Has fraud detection
  5. has buyer and seller protection
  6. supports multi currency transactions


Non-Functional:

List non-functional requirements for the system...

  1. Scalability- application should scale in and out and be deploy-able globally
  2. Availability- application should be highly available with minimal downtime
  3. Reliability- application should be highly reliable for transactions avoiding data corruption and making sure transactions are routed properly
  4. Security- financial data should be highly protected, with protection against scammers and illegal transactions
  5. Performance- transactions should take <5 seconds and account editing should take <1 second


Capacity estimation


User and Session Scale

  • Let's say 200 users a second for transactions, 100 for changing account details,

Data Volume

  • ~5GB storage for user accounts
  • ~5GB for transaction history per day, this should be analyzed real time for fraud detection




API design

Define what APIs are expected from the system...

This API calls will be

  • completeTransaction(payee_id, payer_id, transaction_time, transaction_amount)- generates a transaction_id storing it in the database, updating the cache, and sending the transaction to Kafka
  • createAccount(username, name, email, account status)- generates a user_id at store the account in database as well as updates the cache
  • updateAccount(user_id, thing being changed)- could use method overloading i.e. compile time polymorphism to have multiple API calls with the user_id and the thing being changed so only that needs to be updated
  • viewAccountTransactions(user_id, date_range_start, date_range_end)- returns a list of transactions for the user in a given data range
  • deleteAccount(user_id) - removes the account from the system
  • viewAccount(user_id)- returns account details and settings from cache or database

All work with REST communication because the simple synchronous request response nature is well suited in addition being stateless makes it easier to scale


Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...


Data model- structured data in relational databases for users and transactions to ensure and maintain ACID properties of financial data and less structured key value pairs for Fraud analysis data or logging data


Schema


User

  • Title USER DATA
  • primary key - user_id
  • fields- username, name, email, account status

Transaction

  • Title - TRANSACTION DATA
  • primary key transaction_id
  • field- user_id, transaction_amount, transaction_time, payee_id, payer_id, transaction status

Fraud Analysis

  • Title- ANALYSIS_DATA
  • primary key- analysis_id
  • fields- transaction_id, risk_score, analysis details

Logging

  • Title ALERT_LOG
  • primary key- alert_id
  • fields- user_id, transaction_id, timestamp, type_id


Primary key partitions by most frequently accessed key, can do sharding to distribute databases by having instance handle a discrete range of ids. Amazon Aurora PostgreSQL would be a good choice for the User and transaction databases, and AWS DocumentDB with MongoDB for the non-relational databases. These will work with a caching layer, something like Elasticache or Redis to reduce read operations


High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...


Key Components/Interactions

  • Application Load Balancer to distribute requests and handle auto-scaling
  • API Gateway to act as entry point for all requests routing to proper services and handling authentication and rate limiting
  • Payment service- captures the transaction data pulling from a cache to store user credential and session data, sends it to the user database, the transaction database, and to Kafka, also determines if notifications should reach the user (with credential and session data already stored better for determining if notification should reach user), also converts currencies if needed
  • Kafka- stream data into a data processing service and a data warehouse for historical analysis, send data to alert service if Sagemaker finds fraud
  • Data Processing Service- enhances data by retrieving additional context from a NoSQL database or a transaction cache for better fraud analysis before passing data to Kafka
  • Alerts Service- gets alerts from Kafka and logs them as well as passing on the alert to the payment service to determine if it should return to the user
  • Sagemaker Model- is fed enriched data from the data processing service and return analysis results to Kafka
  • Transactional Database- stores transaction records from payment service ensuring data integrity and transactional consistency, used for auditing and historical analysis
  • User Database- holds details and preferences of user accounts, interacts with the payment service for profile and authentication data
  • Data Warehouse- data from Kafka is stored her as long-term storage for historical data analytics often used in training and updating machine learning models
  • NoSQL Database provides a flexible storage solution to help with quick looks-up operations for data enrichment during processing
  • Log Database- captures system logs, errors and alerts for monitoring and troubleshooting from the alert service
  • User Cache- helps the payment server capture transaction data faster by storing user credentials and session data
  • Transition cache- cache transactions for quick retrieval by the data processing service
  • Account Management service- takes in request from API Gateway to perform CRUD operations on accounts updating cache and database



Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...






Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...


Analysis Database -

  • Overview- will use AWS DocumentDB with MongoDB. Having it in the AWS system allows for seamless integration into the other AWS components.
  • Scalability- the integration with AWS components allows for auto-scaling and global deployment, can also be used in combination with edge locations to ensure that average access time is kept low globally
  • Data Consistency- can use eventual consistency since this database is not the database for auditing and this will increase speed and availability
  • Data Structures- key value pairs allow for rapid read/write access, in addition MongoDB allows for more complex queries and secondary indexing which will help speed up data enriching to make make alerts as quick as possible


Load Balancer

  • Overview- will use AWS Application Load balancer which manages SSL terminations and serves as a proxy to offload the burden of handling HTTPS from back-end services
  • Algorithms- will perform regular health checks on API Gateways to ensure functionality and replace unhealthy instances, uses auto-scaling groups to scale in/out and as needed
  • Session persistence- can use session stickiness to direct subsequent request from the same client, can also be useful in returning alerts
  • Security- Integrates with security groups or firewalls to control things like IP address or protocols
  • Logging and Monitoring- can integrate logging and monitoring tools like CloudWatch to gain insights into traffic patterns, system health, and enable proactive management


Transaction DB

  • Overview- the transaction Database uses Amazon Aurora with a Postgres SQL database. The transactional data and need for string ACID principles to make sure that database is auditable makes SQL a good choice. Amazon Aurora allows for seamless integration with the other AWS components and tools. Postgres is being used in this case because it will allow for individual lines to be recalled which will make auditing easier.
  • Scalability- this DB type is optimized for read/write scalability and supports high-throughput transaction processing and real-time financial operations which is ideal for payment systems.
  • Data Consistency- the relational SQL nature strongly supports consistency as well as the other ACID principles
  • Data structures- primarily uses heap structure to store table data where each table is a heap and each row is a tuple, for indexes balanced trees are used by default to maintain sorted order to allow fast retrieval of rows by key values


Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...


ID generation- ID generation could be done with hashing or a random number generator. Hashing would provide consistent mapping and use less compute power but cause more collision and need additional collision strategies to handle. Even though random number generation requires compute power modern systems can provide efficient number generation and it will significantly reduce collisions, and any collisions that happen can just have the ID re-generated. There would need to be some additional work done to make sure that the IDs generated are being evenly spread across how the databases are sharded but that would just be generating withing alternating windows of ranges.


Load Balancer- Amazon ALB is used for this system because it benefits from the Layer 7 routing, SSL termination and proxy, and content-based routing to multiple service. In addition it benefits from integration in the AWS ecosystem.


SQL Data bases- an amazon product was chosen because with all components in the AWS ecosystem there is better integration and support, Google Spanner may provide more consistency with its TrueTime API but the benefits of an integrated ecosystem outweigh that gain, especially because Amazon Aurora also strongly supports the ACID principles needed for these databases.


NoSQL DB- DocumentDB was chosen, again there is the AWS integration as well as the ability to utilize MongoDB. Since the data processing service may need to pull data by a number of different fields in complex ways and needs to do so quickly, the complex query support of MongoDB was ideal for this application to quickly retrieve a mix of data to enrich before sending to Sagemaker. Also consistency can be more relaxed here so SQL would be unnecessary and the highly tunable consistency of something like Apache Cassandra would be less relevant


Cache- for the cache I would lean AWS Elasticache again for ease of integration with the rest of the AWS ecosystem, with the exception of possible regulations in certain regions requiring more granularity in configuration in which case something like Redis may be better


API Gateway vs. Direct service invocation

  • Pros- offers centralized control over incoming requests, easy implementation of cross-cutting concerns (e.g. authentication and rate limiting)
  • Cons- potential single point of failure if not properly managed
  • Decision- API Gateway is used to standardize entry points for security and manageability with measures (like load balancing and auto-scaling) to ensure high availability


Session management Stateless vs Stateful

  • Stateless
    • Pros- easier to scale
    • Cons- does not support conversational paradigms
    • Decision - stateless services are preferred for scaling ease with state information managed in external stores (e.g. cookies and databases if needed)



Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

  1. Service Overload
    • Scenario- sudden spikes in traffic overload the system
    • Mitigation- use auto-scaling for API gateway and server instances and ensure load balancers are effectively distributing traffic
  2. Cache Failure
    • Scenario- Caching service becomes unavailable leading to increased load on the database
    • Mitigation- deploy multiple instances of each caching services making it stateless wherever possible to enable easy instance recovery and load balancing
  3. Database Failure
    • Scenario- Database server experiences downtime or crashes affecting URL retrieval and creation
    • Mitigation- use distributed databases with replication and fail-over support. Incorporate read replicas to distribute and read load and have hot standbys for quick fail-over
  4. Network Partition
    • Scenario- a communication failure between data centers or between services affects data consistency
    • Mitigation- Design the system using protocols that handle network partitions gracefully (CAP theorem)
  5. Server Crash
    • Scenario- an individual service instance crashes due to server failure or application bugs
    • Mitigation- deploy multiple instances of each service, making it stateless wherever possible to enable easy instance recovery and load balancing
  6. Data Corruption
    • Scenario- data corruption in the database leads to invalid URL mappings
    • Mitigation - implement comprehensive backup and restore strategies, write data integrity checks and use transactions where strong consistency is needed

Bottlenecks

  1. Database write Latency
    • Bottleneck- high write demands during URL creation can slow down the service
    • Solution- use a message queue for write requests to buffer and gradually process incoming requests allowing the system to handle bursts of write operations
  2. Cache Capacity
    • Bottleneck- limited cache size can result in frequent cache misses increasing database load
    • Solution- optimize cache eviction policies and expand cache capacity horizontally by distributing over several nodes
  3. Network Latency
    • Bottleneck- high network latency between services especially if they are geographically dispersed
    • Solution- Use CDNs for regional caching and peer replication across multiple data centers to reduce average latency
  4. API Gateway Load
    • Bottleneck- API Gateway is a bottleneck if not sufficiently scalable
    • Solution- implement a highly available API Gateway setup with load balancing and fail-over to distribute incoming traffic effectively


Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?



Implement CloudFront for lower global latency and better global availability


Machine Learning for Traffic Prediction

  • Improvement- use machine learning to predict traffic bursts and intelligently manage resources
  • Benefit- efficient allocation of system resources based on predictive models can reduce costs and improve service availability under unexpected loads


Increased Interoperability

  • Improvement- integrate with more third-party services for (i.e. social media platforms, content management systems) for seamless online payment
  • Benefit- enhances convenience and adoption as users can pay directly within other platforms they are using


Mobile Application Integration

  • Improvement- develop mobile apps with integrated URL shortening and management features
  • Benefit- provides convenience for users on-the-go, allowing them to manage and analyze links more effectively