In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Data Migration > Change Data Capture > What is Change Data Capture (CDC)

What is Change Data Capture (CDC)

Author: Venkata Sudhakar

Change Data Capture (CDC) is a data integration pattern that identifies and captures changes made to data in a source database - inserts, updates, and deletes - and delivers those changes in real time to downstream systems such as data warehouses, caches, search indexes, or message queues. Rather than periodically querying the entire source table for changed records (batch extraction), CDC continuously monitors the database transaction log and streams only the changed rows as they happen.

CDC works by reading the database write-ahead log (WAL) or transaction log, which is a low-level sequential record of every change committed to the database. In PostgreSQL this is called the WAL, in MySQL it is the binlog, in Oracle it is the redo log, and in SQL Server it is the transaction log. Because the log is written as part of the normal commit process, CDC has virtually zero impact on the source database and captures 100% of changes with no data loss. This is fundamentally different from timestamp-based extraction, which can miss deletes and requires polling.

The most widely adopted open-source CDC tool is Debezium, which runs as a set of Kafka Connect connectors and publishes change events to Apache Kafka topics. Downstream consumers then subscribe to these topics to react to changes in near real time. The below example shows the structure of a Debezium change event published to a Kafka topic when a row is updated in a PostgreSQL database.

It gives the following structure for each operation type,

INSERT event:  before=null,         after={new row data}
UPDATE event:  before={old values}, after={new values}
DELETE event:  before={deleted row}, after=null

The below example shows how to set up a Debezium PostgreSQL connector by creating the connector configuration and submitting it to Kafka Connect via the REST API.

# Example 2: Register a Debezium PostgreSQL CDC connector
# First, enable logical replication in PostgreSQL (postgresql.conf):
#   wal_level = logical
#   max_replication_slots = 4
#   max_wal_senders = 4

# Create a replication slot in PostgreSQL:
# SELECT pg_create_logical_replication_slot('debezium', 'pgoutput');

# Register the connector with Kafka Connect REST API
curl -X POST http://localhost:8083/connectors \
  -H "Content-Type: application/json" \
  -d '{
    "name": "postgres-cdc-connector",
    "config": {
      "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
      "database.hostname": "postgres",
      "database.port": "5432",
      "database.user": "debezium",
      "database.password": "dbpassword",
      "database.dbname": "inventory",
      "database.server.name": "dbserver1",
      "table.include.list": "public.customers,public.orders",
      "plugin.name": "pgoutput",
      "slot.name": "debezium",
      "publication.name": "dbz_publication"
    }
  }'

It gives the following output,

HTTP/1.1 201 Created
{
  "name": "postgres-cdc-connector",
  "config": { ... },
  "tasks": [{ "connector": "postgres-cdc-connector", "task": 0 }],
  "type": "source"
}

# Verify connector is running
curl http://localhost:8083/connectors/postgres-cdc-connector/status
{
  "name": "postgres-cdc-connector",
  "connector": { "state": "RUNNING", "worker_id": "kafka-connect:8083" },
  "tasks": [{ "id": 0, "state": "RUNNING", "worker_id": "kafka-connect:8083" }]
}

# Check Kafka topics created by Debezium
kafka-topics.sh --list --bootstrap-server localhost:9092
dbserver1.public.customers
dbserver1.public.orders

CDC Use Cases in Enterprise Data Migration:

Database replication - Keep a read replica or disaster recovery database in sync with the primary with sub-second latency.

Legacy modernisation - During a migration from an old system to a new system, run both in parallel and use CDC to keep the new system in sync until the cutover is complete.

Data warehouse loading - Stream changes from OLTP databases into BigQuery, Snowflake, or Redshift in near real time instead of running nightly batch ETL jobs.

Cache invalidation - Automatically invalidate or refresh Redis/Memcached entries whenever the source database row changes.

Event sourcing - Use the CDC stream as an event log to trigger downstream microservices whenever data changes in the source system.

Send your comments, suggestions or queries regarding this site to [email protected].