How to understand the principle of CDC (data change) in one article
Introduction to CDC (Data Capture Change)
Data data change (CDC) is a technique used to track down class changes in database operations (listing, updates, deletion) and notify other systems to arrange events. In disaster restoration scenarios, CDC sync the data primarily between the initial database and backup, allowing data synchronization in the actual time of the secondary database.
source ----------> CDC ----------> sink
Apache Setuunnel CDC
SEANNNEL CDC provides two types of data synchronization:
- Reading snapshotRead historical data from a table.
- Additional trackingRead the additional record changes from a table.
Snapped snapshot of the lock
The phase of the shot -free shot is emphasized because many existing CDC platforms, such as Debezium, may close the tables while synchronizing historical data. Reading the snapshot is the process of synchronizing the historical data of the database. The basic flow of this process as follows:
storage -------------> splitEnumerator ---------- split ----------> reader
^ |
| |
\----------------- report -----------/
Division
splitEnumerator
(Divide the distributor) The table data sections are in multiple divisions based on the specified fields (such as the unique table identifier or keys) and the size of the specific step.
Parallel
Each division of a different reader is set for parallel reading. One reader will operate one connection.
Event reactions
After completing the reading of the division, each reader offers progress splitEnumerator
. Descriptive data for division is provided as follows:
String splitId # Routing ID
TableId tableId # Table ID
SeatunnelRowType splitKeyType # The type of field used for partitioning
Object splitStart # Start point of the partition
Object splitEnd # End point of the partition
Once the reader receives division information, it creates the appropriate SQL phrases. Before starting, it records the opposite position for the current Split in the database record. After completing the current division, the reader’s reports are submitted to splitEnumerator
With the following data:
String splitId # Split ID
Offset highWatermark # Log position corresponding to the split, for future validation
Additional synchronization
The synchronization stage begins after the snapshot. At this stage, any changes occur in the source database and their synchronization with the actual backup database. This stage listens to the database record (for example, MySQL Binlog). Additional tracking is usually available to avoid duplicate withdrawals from Binlog and reduce database loading. Therefore, only one reader is used, occupies one connection.
data log -------------> splitEnumerator ---------- split ----------> reader
^ |
| |
\----------------- report -----------/
In the synchronization stage, all divisions and tables are combined from the shot stage in one division. Displayed descriptive data during this stage is as follows:
String splitId
Offset startingOffset # The lowest log start position among all splits
Offset endingOffset # Log end position, or "continuous" if ongoing, e.g., in the incremental phase
List tableIds
Map tableWatermarks # Watermark for all splits
List completedSnapshotSplitInfos # Snapshot phase split details
the CompletedSnapshotSplitInfo
Fields as follows:
String splitId
TableId tableId
SeatunnelRowType splitKeyType
Object splitStart
Object splitEnd
Offset watermark # Corresponds to the highWatermark in the report
The division in the additional stage contains the water mark of all divisions in the shot stage. The minimum water mark is chosen as a starting point for additional synchronization.
Exact indications
Whether in the reading snapshot or additional reading stage, the database may also change to sync. How do we ensure exact delivery?
Stage reading snapshot
In the stage of reading the snapshot, for example, the division is synchronized during changes, such as the insertion of a row k3
Update to k2
Delete k1
. If the task is not used during the reading process, updates can be lost. This seatunnel takes this through:
- First, check the Binlog position (a low watermark) before reading the division.
- Read the data in the range
split{start, end}
. - Record the high watermark after reading.
if high = low
The division data did not change during reading. if (high - low) > 0
Changes occurred during treatment. In such a case, satunnel will:
- Store the division data in memory as a memory table.
- Apply changes from
low watermark
tohigh watermark
In order, use the initial keys to restart the operations on the memory table. - Reporting the high watermark.
insert k3 update k2 delete k1
| | |
v v v
bin log --|---------------------------------------------------|-- log offset
low watermark high watermark
CDC reads: k1 k3 k4
| Replays
v
Real data: k2 k3' k4
Additional stage
Before starting the additional stage, seatunnel first checks all the divisions from the previous step. Between the divisions, the data may be updated, for example, if new records are included between Split1 and Split2, they can be missed during the shot stage. To restore this data between the divisions, Seconneel follows this approach:
- From all divided reports, find the smallest watermark as a starting mark to start reading the record.
- For each log in, check
completedSnapshotSplitInfos
To see if the data has been processed in any division. If not, it is considered data between the divisions and must be corrected. - Once you check the authenticity of all divisions, the process moves to the full additional stage.
|------------filter split2-----------------|
|----filter split1------|
data log -|-----------------------|------------------|----------------------------------|- log offset
min watermark split1 watermark split2 watermark max watermark
A checkpoint and appeal
What about stopping and resuming CDC? SEANTNEL is used as a distributing algorithm (Chandy-Lamport):
Suppose the system has two operations, p1
and p2
where p1
He has three variables X1 Y1 Z1
and p2
He has three variables X2 Y2 Z2
. Initial cases are as follows:
p1 p2
X1:0 X2:4
Y1:0 Y2:2
Z1:0 Z2:3
at this point, p1
It begins a global shot. p1
Records the first case of the operation, then sends a mark to p2
.
Before the brand arrives p2
and p2
He sends a message M
to p1
.
p1 p2
X1:0 -------marker-------> X2:4
Y1:0 <---------M---------- Y2:2
Z1:0 Z2:3
When receiving the brand, p2
He records her condition, and p1
Receives the message M
. since p1
He already performed a local snapshot, it just needs to record the message M
. The final shot looks like this:
p1 M p2
X1:0 X2:4
Y1:0 Y2:2
Z1:0 Z2:3
In Censunnel CDC, signs are sent to all readers, divided status, book, and other nodes, each of which maintains a state of memory.