My Solution for Design a Webhook Notification Service
by nectar4678
System requirements
Functional:
- Receive Webhooks: Accept payment event notifications (success, failure, chargeback) from a payment gateway.
- Validate Events: Ensure that events received are from a trusted source (e.g., using signatures or token-based authentication).
- Process Events: Based on event type, trigger specific actions such as updating payment records, notifying users, or initiating refunds.
- Acknowledge Events: Send a response to the payment gateway acknowledging successful receipt and processing of the event.
- Handle Idempotency: Safeguard against duplicate webhook events by implementing idempotency keys.
- Persist Events: Log and store the received events and their statuses (success, failure, etc.) for auditing and troubleshooting.
- Retry Mechanism: Automatically retry processing events in case of transient failures.
- Security: Ensure secure transport of event data using HTTPS and handle authorization.
- Scale: Handle high throughput and guarantee low-latency processing for a large number of events.
- Alerting and Monitoring: Integrate alerting to notify administrators of failures or abnormal processing times.
Non-Functional:
- Scalability: The system must scale to handle spikes in traffic, especially during high-demand periods (e.g., Black Friday sales).
- Availability: The service should aim for high availability (99.9% uptime or higher).
- Performance: The system should aim to process events with low latency (preferably under 100 ms for most events).
- Security: Ensure that data is encrypted both in transit (using HTTPS) and at rest, with stringent authentication measures in place for validating webhooks.
Capacity estimation
Assumptions:
- Peak Traffic: We assume that the system will need to handle around 100,000 webhook events per minute during peak traffic.
- Event Payload Size: The average size of an event payload is around 1 KB.
- Data Retention: We will store event logs for 30 days for audit purposes.
Throughput Estimation:
- Events per minute: 100,000 events per minute (1,666 events per second).
- Data Ingestion Rate:
- 1,666 events/second * 1 KB/event = ~1.63 MB/s of incoming traffic.
- This leads to about 100 MB per minute of data ingestion during peak times.
Storage Estimation:
- Daily Event Storage:
- 100,000 events/min * 60 min/hr * 24 hr/day = 144 million events/day.
- Total storage requirement: 144 million events/day * 1 KB/event = ~137 GB/day.
- Monthly Storage (30 days):
- ~137 GB/day * 30 days = 4.1 TB/month of storage for logs.
API design
Receive Webhook Event API
This is the only core api that we are going to focus on.
Endpoint:
POST /api/webhook/events
Request:
{
"event_type": "payment_success",
"event_id": "evt_123456",
"payment_id": "pay_987654",
"amount": 1000,
"currency": "USD",
"status": "success",
"timestamp": "2024-09-15T12:00:00Z",
"signature": "sha256=abcd1234"
}
Other api's which we require are Acknowledge Event API, Fetch Event Logs API, Retry Failed Event API which are pretty much straight forward.
General Considerations for the API Design:
- Idempotency:
- All events must have a unique
event_id
to ensure that multiple deliveries of the same event are not processed more than once. - Events should be processed idempotently, meaning repeated calls with the same
event_id
should not lead to inconsistent state changes.
- All events must have a unique
- Security:
- Each request to the API will be secured using HMAC signatures or a similar mechanism, which allows the receiver to validate that the request came from a trusted payment gateway.
- All communication must be over HTTPS.
Database design
Key Entities:
- Event: Represents a webhook event received from the payment gateway.
- Event Log: Stores logs for each event, including its status and timestamps for auditing purposes.
- Retry Log: Tracks retry attempts for events that failed to process initially.
Great! Let's proceed to the Database Design for the Webhook Notification Service. We will define the data model, including key entities and their relationships, and draw an Entity-Relationship (ER) diagram using Mermaid.
4. Database Design
The database design for the Webhook Notification Service will focus on capturing and storing event data, ensuring idempotency, and supporting logging for audit and retry mechanisms.
Key Entities:
- Event: Represents a webhook event received from the payment gateway.
- Event Log: Stores logs for each event, including its status and timestamps for auditing purposes.
- Retry Log: Tracks retry attempts for events that failed to process initially.
Table Definitions:
4.1 Event Table
The event
table will store information about each webhook event received.
ColumnTypeDescription | ||
event_id | VARCHAR | Primary key, unique identifier for the event (UUID format). |
event_type | VARCHAR | Type of event (e.g., payment_success, payment_failure). |
payment_id | VARCHAR | Identifier for the related payment transaction. |
amount | DECIMAL | Amount involved in the transaction. |
currency | VARCHAR | Currency code (e.g., USD). |
status | VARCHAR | Event status (e.g., success, failure, pending). |
signature | VARCHAR | Cryptographic signature for validation. |
created_at | TIMESTAMP | Timestamp when the event was received. |
updated_at | TIMESTAMP | Timestamp when the event status was last updated. |
4.2 Event Log Table
The event_log
table will store detailed logs for each event, providing a historical record of actions taken.
ColumnTypeDescription | ||
log_id | BIGINT | Primary key, auto-incremented log entry ID. |
event_id | VARCHAR | Foreign key referencing the event table. |
status | VARCHAR | The status at the time of logging (e.g., success, failure). |
message | TEXT | Detailed log message (e.g., error description or action). |
logged_at | TIMESTAMP | Timestamp when the log entry was created. |
4.3 Retry Log Table
The retry_log
table will track retries for events that failed to process on the first attempt.
ColumnTypeDescription | ||
retry_id | BIGINT | Primary key, auto-incremented log entry ID. |
event_id | VARCHAR | Foreign key referencing the event table. |
retry_count | INT | Number of retry attempts. |
retry_status | VARCHAR | The outcome of the retry attempt (e.g., success, failure). |
retry_at | TIMESTAMP | Timestamp when the retry attempt was made. |
4.4 ER Diagram
Here’s a Mermaid diagram representing the relationship between the entities (Event
, Event Log
, Retry Log
):
Database Design Considerations:
- Indexes:
- The
event_id
field in theevent
,event_log
, andretry_log
tables should be indexed to ensure fast lookup times. - Additionally, indexing the
status
field in theevent
table would help in retrieving events based on their status (e.g., pending, success, failed).
- The
- Foreign Key Constraints:
- The
event_id
in theevent_log
andretry_log
tables should be a foreign key that references theevent
table to maintain referential integrity.
- The
- Retention Policy:
- Consider implementing a data retention policy to archive or delete logs older than a specified period (e.g., 30 days) to optimize database storage.
- Scalability:
- For high-volume applications, the database should be capable of horizontal scaling. We can employ sharding strategies based on
event_type
orevent_id
to distribute data across multiple database nodes. - NoSQL databases (e.g., Cassandra) could be considered for scalability if the workload increases dramatically.
- For high-volume applications, the database should be capable of horizontal scaling. We can employ sharding strategies based on
High-level design
The Webhook Notification Service consists of several core components that work together to receive, validate, process, and store webhook events. Here's a breakdown of the components:
Core Components:
- Webhook Receiver: This component acts as the entry point for receiving webhook events from the payment gateway. It handles incoming HTTP requests and forwards the data for processing.
- Event Validator: Validates the incoming webhook events, ensuring that they are authentic (via HMAC signatures) and correctly formatted.
- Event Processor: Processes each event based on its type (e.g., payment success, failure, chargeback), triggering corresponding actions within the application.
- Event Logger: Persists the event and any related logs into the database. It ensures that all received events are recorded for auditing purposes.
- Retry Mechanism: A background worker that retries processing of failed events, based on the event status in the database and logs.
- Monitoring and Alerts: Tracks the performance of the system, ensuring that administrators are alerted in case of failures, abnormal processing times, or other anomalies.
- Database: Stores the event data, logs, and retry information. Supports querying for event status, retries, and history.
Request flows
Detailed Explanation of the Request Flow:
- Event Reception: The process begins when the payment gateway sends a webhook event (e.g., payment success) to the Webhook Receiver.
- Validation: The Event Validator checks the HMAC signature and payload format to ensure the request is authentic and correctly formatted. Invalid events are rejected; valid events proceed to the Event Processor.
- Event Processing: The Event Processor handles the business logic based on the event type, updating records (e.g., marking payments as successful) and triggering actions such as user notifications.
- Persistence: The Event Logger stores the event details and processing status in the database, ensuring a reliable audit trail.
- Retries: If processing fails, the Retry Mechanism retrieves the failed event and reattempts processing. After multiple failures, alerts are raised for investigation.
- Monitoring: The Monitoring & Alerts system tracks the system's health, raising notifications if there are issues like repeated failures or slow processing times.
Detailed component design
Event Validator
The Event Validator ensures that each webhook event is authentic and properly formatted before further processing. This is crucial for security and preventing malicious data from entering the system.
Key Functions:
- HMAC Signature Verification: The system compares the HMAC signature provided by the payment gateway with a computed signature using a shared secret key. This ensures the event was not tampered with in transit.
- Payload Validation: The validator checks the event payload for required fields (e.g.,
event_id
,payment_id
,status
, etc.) and ensures they adhere to expected formats. - Rejection of Invalid Requests: If the validation fails (due to an invalid signature or missing fields), the event is rejected with a
400 Bad Request
response.
Retry Mechanism
The Retry Mechanism ensures that events that fail during processing due to transient errors (e.g., database downtime or network issues) are retried without manual intervention.
Key Functions:
- Fetching Failed Events: The mechanism periodically queries the database for events marked as failed, particularly transient errors, and retries processing them.
- Exponential Backoff: The system uses an exponential backoff strategy, where the interval between retries increases after each failure (e.g., retry after 1 min, 5 mins, 15 mins).
- Maximum Retry Limit: After a defined number of retries (e.g., 3 attempts), the event is marked as permanently failed, and an alert is raised for manual review.
Component Design Considerations
- Resilience: Both components are designed to handle failures gracefully, ensuring that invalid or transient issues do not disrupt the system.
- Scalability: The system can scale horizontally, adding more instances of the Event Validator and Retry Mechanism workers as traffic increases.
- Extensibility: New validation rules or retry logic can be added in the future without disrupting the existing workflow.
Trade offs/Tech choices
Idempotency Management
Implementing idempotency ensures that duplicate events are processed only once, but it introduces additional storage and processing overhead to track each event_id
. Each event must be uniquely identifiable and stored with its status, which can increase the load on the database.
A lighter solution could be to discard duplicate events at the network level, but this sacrifices reliability if duplicate events are legitimate and need to be handled.
Retry Strategy with Exponential Backoff
Using exponential backoff for retries reduces the risk of overwhelming the system with retries during a failure. However, this can delay the resolution of issues, as events might take longer to be retried, especially during high traffic periods.
A constant retry interval would lead to faster processing of retries but could overload the system if failures persist.
Failure scenarios/bottlenecks
- Ensure proper HMAC signature verification is in place. Send detailed error responses back to the payment gateway and log the failure for auditing. Monitor the frequency of invalid signatures and alert administrators in case of suspicious activity.
- Log detailed error messages for all failures. Implement exponential backoff with the retry mechanism to avoid overwhelming downstream services. Raise alerts when events fail after the maximum number of retries.
- Scale the system horizontally by adding more Webhook Receiver instances behind a load balancer. Use an autoscaling mechanism to handle sudden traffic spikes. Monitor response times and raise alerts when they exceed thresholds.
- Implement replay protection by validating the timestamp included in the event payload. Reject events older than a predefined threshold (e.g., 5 minutes). Log and monitor for replay attempts and suspicious activity.
Future improvements
- Implement more sophisticated retry strategies such as prioritizing retries based on event type (e.g., prioritizing payment successes over failures) or dynamic backoff intervals based on system load.
- Introduce event partitioning based on attributes like
event_type
orpayment_id
to distribute load more effectively across multiple processing nodes or database partitions. - Allow custom rate limits per client or payment gateway, ensuring that heavy users are throttled more strictly while still allowing small-scale clients reasonable throughput.