What is Change Data Capture ?
In this digital age of data, if you are working with data day in and day out, you would have definitely come across the term “Change Data Capture” (CDC) somewhere. The concept is very important especially when the amount of data you need to process is growing.
Change data capture (CDC) is the process of capturing changes made at the data source and applying them throughout the enterprise. CDC minimizes the resources required for ETL ( extract, transform, load ) processes because it only deals with data changes. The goal of CDC is to ensure data synchronicity.
Methods of Change Data Capture :
Listed below are four common methods to perform CDC, and some of their challenges :
Timestamp-based CDC :
Many transactional applications keep track of metadata in every row including who created and/or most-recently modified the row, as well as when the row was created and last modified. The approach to CDC in such an environment is to keep track of when changes are extracted, and in a subsequent run filter on the DATE_MODIFIED column to only retrieve rows that were modified since the most recent time data was extracted.
Challenges with Timestamp-based CDC are :
- Data deletes are a challenge because there is no DATE_MODIFIED for a deleted row (unless deletes are logical and update a flag in the row indicates the row was deleted).
- DATE_MODIFIED must be available on all tables and must be reliably set. Database triggers may be a good way to set the values but these may introduce overhead on the transactional application.
- Extracting the changes uses a lot of resources. Of course DATE_MODIFIED may be indexed to lower the impact of the select statement at the cost of storing (and continuously updating) the additional index.
Using DATE_MODIFIED for CDC works well for traditional data warehouse applications that are populated using Extract, Transform and Load (ETL) jobs, when the source tables don’t process deletes.
Trigger-based CDC :
Database triggers can be used to perform CDC in shadow tables. The shadow tables may store the entire row to keep track of every single column change, or only the primary key is stored as well as the operation type (insert, update or delete).
Challenges with Trigger-based CDC are :
- Firing the trigger, and storing the row changes in a shadow table, introduces overhead. In an extreme case CDC may introduce 100% overhead on the transaction i.e. instead of .1 second it may take .2 seconds to complete a transaction.
- The lower-overhead alternative to only store the primary key of the table requires a join back to the source table to retrieve the changes which (1) increases the load to retrieve the changes, and (2) loses intermediate changes if multiple changes took place on the same row.
- Should the source application perform a truncate then chances are the trigger won’t fire and changes are not recorded. Also, if changes are made to tables then triggers and shadow tables may also have to be modified, recreated and/or recompiled which introduces extra overhead to manage and maintain the database.
CDC using database triggers lowers the overhead to extract the changes but increases the overhead to record the changes.
Snapshot Merge CDC :
This is a simple technique where regularly scheduled table exports (“snapshots”) or staging tables are used to identify changed records. By calculating the difference between the current and previous snapshots, all new, updated or deleted records can be captured and loaded into the data warehouse .
Challenges with Snapshot merge CDC are :
- To perform the diff requires a lot of resources to compute the differences between the data, and resource consumption grows at least linearly with the growth in data volume.
- CDC cannot be performed in real-time because the diff realistically takes too many resources to perform all the time.
Compared to the Timestamp-based CDC method the snapshot method does not have the challenge with deleted rows. The diff method works well for low data volumes.
Log Scraping CDC :
Database applications can be configured to track all activity in log files. For CDC purposes, those application log files can be scanned and parsed (“scraped”) to identify when changes occur and capture those records .
The challenges with log-based CDC are:
- Interpreting the changes in the transaction log is difficult because there are no documented standards on how the changes are stored , and there are many scenarios that must all be considered and tested.
- Database vendors may not provide an interface to the transaction logs – documented or not – and even if there is one it may be relatively slow and/or resource intensive.
- Most databases have been optimized to only use internal identifiers to recover database row changes which is insufficient to perform CDC and record the changes on a different system.