Journalizing Knowledge Modules (JKM)

Journalizing Knowledge Modules (JKMs) are a crucial component in the Change Data Capture (CDC) process within the ETL pipeline. Unlike other Knowledge Modules such as IKMs and LKMs, JKMs are not directly used in mappings but are instead used in models to initialize and manage the CDC infrastructure. Their primary role is to track changes in source data, allowing for the efficient and accurate extraction of only the changed records for ETL processing.

Overview of Journalizing Knowledge Modules (JKM)

Purpose:

  • JKMs create and configure the infrastructure necessary for Change Data Capture (CDC).
  • CDC is the process of identifying and capturing changes (inserts, updates, deletes) made to the source data over time.
  • The JKM sets up the environment where this change-tracking infrastructure resides, which consists of several components:
    • Subscribers Table
    • Table of Changes
    • Views on Change Tables
    • Triggers or Log Capture Programs

How JKMs Work

  1. CDC Infrastructure Initialization:
    • JKMs do not directly participate in mappings (like loading or transforming data), but instead, they are used within the model to define how Change Data Capture (CDC) will operate.
    • JKMs initialize the components that will track changes to the data over time. These components ensure that the system can monitor, capture, and store data changes (such as insertions, updates, and deletions).
  2. Components of the CDC Infrastructure:
    • Subscribers Table:
      • This table is used to track the subscribers to the change data. These subscribers could be other processes or tables that need to be notified when changes are captured.
    • Table of Changes:
      • The change table stores all the changes (inserts, updates, deletes) made to the source data. This table allows the ETL process to capture just the changes, rather than having to reload entire datasets.
    • Views on the Change Table:
      • Views are created on the change table to allow for easier querying and filtering of changes. Views could be set up to filter by change type (insert, update, delete) or by time range.
    • Triggers or Log Capture Programs:
      • Triggers or log capture programs are used to capture changes as they happen in the source system. These can either be:
        • Database triggers that are triggered whenever an insert, update, or delete occurs on the source table.
        • Log-based capture programs that read from database transaction logs to identify changes.
  3. JKM Role in ETL Process:
    • Initialization and Configuration: The JKM is responsible for configuring the CDC infrastructure for a model or datastore.
    • Change Capture: JKMs set up the mechanisms (triggers, log capture) that ensure changes to source data are accurately recorded. This means that only the changed records will be extracted and processed in later steps, improving ETL performance and minimizing unnecessary data processing.
    • Optimizing Data Extraction: By capturing only the changes made to data, JKMs optimize the extraction process. This avoids re-processing unchanged data and ensures that the ETL pipeline is as efficient as possible.
  4. JKMs and Models:
    • JKMs are tied to models in the ETL process. A model represents the schema of the source system or datastore.
    • When you define a JKM in a model, you're essentially setting up the change-tracking infrastructure for that specific source or datastore. This setup allows for incremental loading of data based on changes detected by the JKM infrastructure.

 

Detailed Steps in the JKM Process:

  1. Set Up the CDC Infrastructure:
    • Create Subscribers Table:
      • The JKM initializes a subscribers table that will record which processes or systems need to be notified about changes to the source data.
    • Create Table of Changes:
      • The JKM creates a change table that will hold records of changes (inserts, updates, and deletes) made to the source data. This table ensures that only modified data will be processed in future ETL jobs.
    • Create Views on the Table of Changes:
      • To make it easier to query changes, the JKM generates views that can filter the changes based on certain criteria (e.g., change type or timestamp).
    • Configure Triggers or Log Capture Programs:
      • The JKM configures database triggers or log capture programs to detect changes in the source system. These triggers or programs will be responsible for capturing when data is modified, so that only the changed data is captured and processed.
      • For example, if a new record is inserted into the source table, the trigger will insert the corresponding change into the change table.
  2. Changes Are Captured:
    • As changes occur in the source system (e.g., new records are added, existing records are modified or deleted), the CDC infrastructure captures these changes and stores them in the change table.
  3. ETL Process (Using LKMs and IKMs):
    • After the changes have been captured by the JKM, the rest of the ETL process is responsible for extracting and processing these changes.
    • LKMs (Loading Knowledge Modules) will extract the changes from the change table and load them into the staging area or the target datastore.
    • IKMs (Integration Knowledge Modules) will then handle the transformation and loading of the data into the final target.

 

Types of Change Data Capture (CDC) Strategies:

  1. Trigger-Based CDC:
    • In this strategy, database triggers are set up on the source tables. These triggers will automatically fire when a change (insert, update, delete) occurs on the table, and the changes are then logged into the change table.
  2. Log-Based CDC:
    • Log capture programs are used to read the database transaction logs. This method doesn’t require database triggers and is more efficient for systems with high transaction volumes, as it can capture changes in real-time.

 

JKM Benefits:

  1. Incremental ETL Processing:
    • By tracking only the changes, JKMs help avoid full-table scans and enable incremental ETL processes, which can improve performance by reducing the amount of data being processed.
  2. Real-Time Data Integration:
    • With triggers or log-based capture mechanisms, JKMs enable real-time or near-real-time data integration, ensuring that the target datastore stays up to date with the source.
  3. Efficient Data Tracking:
    • JKMs allow for fine-grained tracking of data changes, making it easier to manage data in scenarios involving Slowly Changing Dimensions (SCDs) or historical tracking.

Conclusion:

Journalizing Knowledge Modules (JKM) play a vital role in implementing Change Data Capture (CDC) in the ETL process. They set up the infrastructure necessary to track data changes in the source system, including creating the change table, views, and triggers or log capture programs. By capturing only the changes made to source data, JKMs improve ETL performance and enable more efficient data processing, especially for real-time and incremental data integration.

 

No comments:

Post a Comment