Postgres SQL
Kafka-connect
Change-data-capture
Standalone mode
Kafka topics

Change-data-capture from Postgres SQL to kafka topics using standalone mode Kafka-connect

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Change Data Capture (CDC) is a design pattern that monitors and records the changes in data so that other software can respond to those changes. In the world of databases such as PostgreSQL, CDC can be used to capture insertions, updates, and deletions, and then propagate these changes to other systems like Apache Kafka for further processing, analytics, or data integration.

Using Kafka Connect for Postgres CDC

To capture changes from a PostgreSQL database and stream them to Kafka topics, Kafka Connect with Debezium connectors makes a robust solution. Kafka Connect is an open-source component of Apache Kafka that provides scalable and reliable way to move data between Kafka and other data systems in real or near-real time.

Standalone Mode Kafka Connect

Kafka Connect can be configured in two modes: standalone and distributed. Standalone mode runs Kafka Connect in a single process, whereas distributed mode runs it across multiple machines. For development or small scale environments, standalone mode is typically sufficient.

Setting Up Kafka Connect with Debezium for PostgreSQL

Here’s a step-by-step guide to set up CDC from PostgreSQL to Kafka topics using Kafka Connect in standalone mode:

1. Environment Preparation

Have a PostgreSQL server running with database changes that need to be captured, and an Apache Kafka broker along with Zookeeper setup.

2. Install and Configure Kafka Connect

  1. Install Kafka (if not already installed) which includes Kafka Connect.
  2. Add the Debezium PostgreSQL connector plugin to the Kafka Connect plugins directory.

3. Configure PostgreSQL

Enable logical replication:

  • Set wal_level to logical.
  • Ensure max_replication_slots and max_wal_senders are set to a reasonable value.
sql
1-- Postgres configuration example
2ALTER SYSTEM SET wal_level = logical;
3ALTER SYSTEM SET max_replication_slots = 4;
4ALTER SYSTEM SET max_wal_senders = 4;

4. Create a Connector Configuration

Create a JSON file (postgres-connector.json) that defines the Debezium PostgreSQL connector configuration:

json
1{
2    "name": "postgres-connector",
3    "config": {
4        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
5        "plugin.name": "pgoutput",
6        "tasks.max": "1",
7        "database.hostname": "localhost",
8        "database.port": "5432",
9        "database.user": "postgres-user",
10        "database.password": "postgres-pw",
11        "database.dbname" : "testdb",
12        "database.server.name": "testdb",
13        "table.whitelist": "public.table1",
14        "key.converter": "org.apache.kafka.connect.json.JsonConverter",
15        "value.converter": "org.apache.kafka.connect.json.JsonConverter",
16        "key.converter.schemas.enable": "false",
17        "value.converter.schemas.enable": "false",
18        "publishAll": "false"
19    }
20}

5. Start Kafka Connect in Standalone Mode

Navigate to the Kafka Connect installation directory and start Kafka Connect with the connector configuration file:

bash
./bin/connect-standalone.sh config/connect-standalone.properties config/postgres-connector.json

Monitoring and Managing Connectors

While Kafka Connect is running, you can manage and monitor the connectors using the Kafka Connect REST API. For example, to check the status of a connector:

bash
curl http://localhost:8083/connectors/postgres-connector/status

Summary Table of Key Components and Configurations

ComponentDescription
PostgreSQLSource RDBMS where data changes happen.
Kafka ConnectTool for reliably streaming data between Apache Kafka and other systems.
Debezium ConnectorKafka Connect plugin to capture data change events from PostgreSQL.
Kafka TopicsDestination in Kafka where change events are stored.
ZookeeperCoordination service for Kafka brokers and Kafka Connect.

Conclusion

Using Kafka Connect in standalone mode with a Debezium PostgreSQL connector is an effective way to perform CDC, streaming database changes into Kafka for real-time data processing and monitoring. This guide provides a comprehensive overview to get started with setting up CDC from PostgreSQL using Kafka Connect.

With this architecture, businesses can leverage real-time data streaming to enable more agile data management and analysis, enhancing operational flexibility and insight.


Course illustration
Course illustration

All Rights Reserved.