gtag('config', 'G-0PFHD683JR');
Price Prediction

How to understand the principle of CDC (data change) in one article

Introduction to CDC (Data Capture Change)

Data data change (CDC) is a technique used to track down class changes in database operations (listing, updates, deletion) and notify other systems to arrange events. In disaster restoration scenarios, CDC sync the data primarily between the initial database and backup, allowing data synchronization in the actual time of the secondary database.

source ----------> CDC ----------> sink

Apache Setuunnel CDC

SEANNNEL CDC provides two types of data synchronization:

  • Reading snapshotRead historical data from a table.
  • Additional trackingRead the additional record changes from a table.

Snapped snapshot of the lock

The phase of the shot -free shot is emphasized because many existing CDC platforms, such as Debezium, may close the tables while synchronizing historical data. Reading the snapshot is the process of synchronizing the historical data of the database. The basic flow of this process as follows:

storage -------------> splitEnumerator ---------- split ----------> reader
                            ^                                   |
                            |                                   |
                            \----------------- report -----------/

Division

splitEnumerator (Divide the distributor) The table data sections are in multiple divisions based on the specified fields (such as the unique table identifier or keys) and the size of the specific step.

Parallel

Each division of a different reader is set for parallel reading. One reader will operate one connection.

Event reactions

After completing the reading of the division, each reader offers progress splitEnumerator. Descriptive data for division is provided as follows:

String              splitId         # Routing ID
TableId             tableId         # Table ID
SeatunnelRowType    splitKeyType    # The type of field used for partitioning
Object              splitStart      # Start point of the partition
Object              splitEnd        # End point of the partition

Once the reader receives division information, it creates the appropriate SQL phrases. Before starting, it records the opposite position for the current Split in the database record. After completing the current division, the reader’s reports are submitted to splitEnumerator With the following data:

String      splitId         # Split ID
Offset      highWatermark   # Log position corresponding to the split, for future validation

Additional synchronization

The synchronization stage begins after the snapshot. At this stage, any changes occur in the source database and their synchronization with the actual backup database. This stage listens to the database record (for example, MySQL Binlog). Additional tracking is usually available to avoid duplicate withdrawals from Binlog and reduce database loading. Therefore, only one reader is used, occupies one connection.

data log -------------> splitEnumerator ---------- split ----------> reader
                            ^                                   |
                            |                                   |
                            \----------------- report -----------/

In the synchronization stage, all divisions and tables are combined from the shot stage in one division. Displayed descriptive data during this stage is as follows:

String                              splitId
Offset                              startingOffset                  # The lowest log start position among all splits
Offset                              endingOffset                    # Log end position, or "continuous" if ongoing, e.g., in the incremental phase
List                       tableIds
Map                tableWatermarks                 # Watermark for all splits
List    completedSnapshotSplitInfos     # Snapshot phase split details

the CompletedSnapshotSplitInfo Fields as follows:

String              splitId
TableId             tableId
SeatunnelRowType    splitKeyType
Object              splitStart
Object              splitEnd
Offset              watermark       # Corresponds to the highWatermark in the report

The division in the additional stage contains the water mark of all divisions in the shot stage. The minimum water mark is chosen as a starting point for additional synchronization.

Exact indications

Whether in the reading snapshot or additional reading stage, the database may also change to sync. How do we ensure exact delivery?

Stage reading snapshot

In the stage of reading the snapshot, for example, the division is synchronized during changes, such as the insertion of a row k3Update to k2Delete k1. If the task is not used during the reading process, updates can be lost. This seatunnel takes this through:

  • First, check the Binlog position (a low watermark) before reading the division.
  • Read the data in the range split{start, end}.
  • Record the high watermark after reading.

if high = lowThe division data did not change during reading. if (high - low) > 0Changes occurred during treatment. In such a case, satunnel will:

  • Store the division data in memory as a memory table.
  • Apply changes from low watermark to high watermark In order, use the initial keys to restart the operations on the memory table.
  • Reporting the high watermark.
          insert k3      update k2      delete k1
                |               |               |
                v               v               v
 bin log --|---------------------------------------------------|-- log offset
      low watermark                                     high watermark

CDC reads:    k1 k3  k4
                    | Replays
                    v
Real data:    k2 k3' k4

Additional stage

Before starting the additional stage, seatunnel first checks all the divisions from the previous step. Between the divisions, the data may be updated, for example, if new records are included between Split1 and Split2, they can be missed during the shot stage. To restore this data between the divisions, Seconneel follows this approach:

  • From all divided reports, find the smallest watermark as a starting mark to start reading the record.
  • For each log in, check completedSnapshotSplitInfos To see if the data has been processed in any division. If not, it is considered data between the divisions and must be corrected.
  • Once you check the authenticity of all divisions, the process moves to the full additional stage.
    |------------filter split2-----------------|
          |----filter split1------|                   
data log -|-----------------------|------------------|----------------------------------|- log offset
        min watermark      split1 watermark    split2 watermark                    max watermark

A checkpoint and appeal

What about stopping and resuming CDC? SEANTNEL is used as a distributing algorithm (Chandy-Lamport):

Suppose the system has two operations, p1 and p2where p1 He has three variables X1 Y1 Z1 and p2 He has three variables X2 Y2 Z2. Initial cases are as follows:

p1                                  p2
X1:0                                X2:4
Y1:0                                Y2:2
Z1:0                                Z2:3

at this point, p1 It begins a global shot. p1 Records the first case of the operation, then sends a mark to p2.

Before the brand arrives p2and p2 He sends a message M to p1.

p1                                  p2
X1:0     -------marker------->      X2:4
Y1:0     <---------M----------      Y2:2
Z1:0                                Z2:3

When receiving the brand, p2 He records her condition, and p1 Receives the message M. since p1 He already performed a local snapshot, it just needs to record the message M. The final shot looks like this:

p1 M                                p2
X1:0                                X2:4
Y1:0                                Y2:2
Z1:0                                Z2:3

In Censunnel CDC, signs are sent to all readers, divided status, book, and other nodes, each of which maintains a state of memory.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button