SCD Type 2

1. Purpose of SCD Type 2:

  • Track Changes Over Time: SCD Type 2 is primarily used in data warehousing to manage slowly changing dimensions—attributes that change infrequently over time, but require historical tracking.
  • Preserve Historical Data: It ensures that all historical versions of an entity (e.g., a product or customer) are preserved, even when attributes such as the product’s supplier or customer’s address change. This is crucial for maintaining a reliable historical record.

2. Key Components of SCD Type 2 Table:

  • Surrogate Key:
    • A unique, system-generated identifier for each row in the dimension table (often an auto-incremented integer). The surrogate key replaces the natural key from the operational system to avoid issues with data changes in the source system (e.g., changes in business keys).
    • Example: Product_ID (surrogate key) vs. Product_SKU (natural key).
  • Natural Key:
    • This is the actual business key or identifier from the source system, such as a Product_ID or Customer_ID. The natural key remains consistent across versions of the same entity.
    • Example: A product’s natural key could be the SKU code.
  • Attributes to Overwrite:
    • These are the dimension attributes that can change over time. When such attributes change, SCD Type 2 creates a new record to preserve both the old and new versions.
    • Example: Product attributes like supplier, category, price, or size.
  • Attributes that Require New Row:
    • When certain attributes change, instead of updating the existing row, a new row is inserted to preserve historical integrity.
    • Example: A product’s family or supplier might require a new row in the data warehouse if it changes.
  • Starting Timestamp:
    • A field that stores the date when a record first became valid in the data warehouse. It marks the beginning of the record’s validity period.
    • Example: If a product’s supplier changes on January 1st, 2023, the starting timestamp for the new record would be set to this date.
  • Ending Timestamp:
    • This field records the date when the record is no longer valid. For the historical version of a record, it marks the date when the change occurred.
    • Example: The record of the old supplier will have an ending timestamp of December 31st, 2022, once the new version is active.
  • Current Record Flag:
    • This flag (usually 1 or 0) indicates whether the row is the most recent (active) version of the record.
    • A current record is marked with a 1, and the previous historical record is marked with 0.
    • Example: If the current supplier of the product is “Supplier B,” the row representing that supplier would have a current record flag of 1, while the previous supplier, “Supplier A,” would have the flag 0.

3. Example Scenario:

Imagine a product, identified by its Product ID (natural key), which has attributes like name, size, supplier, and family.

  • Initial Data:
    • Product ID: 123
    • Supplier: "Supplier A"
    • Family: "Electronics"
    • Size: "Medium"
    • Starting Date: 01-Jan-2020
    • Ending Date: NULL (still active)
    • Current Flag: 1
  • After Supplier Change:
    • When the supplier changes from "Supplier A" to "Supplier B," a new record is inserted:
      • Product ID: 123 (same natural key)
      • Supplier: "Supplier B"
      • Family: "Electronics"
      • Size: "Medium"
      • Starting Date: 01-Jan-2023
      • Ending Date: NULL (still active)
      • Current Flag: 1
    • The old record for "Supplier A" is now marked as historical:
      • Product ID: 123
      • Supplier: "Supplier A"
      • Family: "Electronics"
      • Size: "Medium"
      • Starting Date: 01-Jan-2020
      • Ending Date: 31-Dec-2022
      • Current Flag: 0

4. Behavior in Data Warehouse:

  • Historical Data Preservation: Every time a dimension attribute changes, a new row is inserted with the updated information, ensuring that both the old and new records are preserved.
  • Tracking Changes: New rows are inserted into the dimension table for every change in attribute values (like supplier or family), and the system uses the surrogate key to track these versions over time.
  • Record Validity:
    • The combination of the starting and ending timestamps helps to track the period during which each record is valid.
    • The current record flag helps to quickly identify which record represents the most recent version of the entity.

5. Benefits of SCD Type 2:

  • Maintains Historical Context: It ensures that users can analyze data as it existed at any point in the past, which is critical for accurate reporting, trend analysis, and decision-making.
  • Accurate and Detailed Reporting: It enables reporting to reflect the state of data at specific points in time, preserving the ability to review data history even as changes occur.
  • Data Integrity and Auditing: By using surrogate keys and timestamps, the data warehouse ensures that data remains consistent and that historical changes are traceable.

 

 

Here’s a step-by-step breakdown of the process described:

  1. Initialization of Product Dimension in Data Warehouse (March 12, 2006):
    • All product records are inserted into the Data Warehouse.
    • A surrogate key is calculated and assigned to each product.
    • A fake ending date of January 1, 2400 is assigned to all records (indicating they are the current records).
    • The current record flag is set to 1 for all records, indicating they are valid and current.
  2. Updates in the Operational System:
    • Product P1: The supplier for product P1 is updated in the operational system.
    • Product P2: The family for product P2 is updated in the operational system.
    • Product P3: The name for product P3 is updated in the operational system.
    • Product P5: A new product, P5, is added to the operational system.

These changes will likely trigger updates or inserts in the Data Warehouse to reflect the modifications and additions in the operational system, ensuring the data remains synchronized.

 

 

Here’s a step-by-step breakdown of the process you described, highlighting the impact on the data warehouse dimension:

1. Update Supplier for Product P1:

  • Action: The supplier information for product P1 is updated in the operational system.
  • Impact on Data Warehouse:
    • A new current record is created with a new Surrogate Key (Surrogate Key 5).
    • The previous record with Surrogate Key 1 is marked as closed (historical) by setting an appropriate ending date and updating the current record flag to 0.

2. Update Family for Product P2:

  • Action: The family information for product P2 is updated in the operational system.
  • Impact on Data Warehouse:
    • A new current record is created with a new Surrogate Key (Surrogate Key 6).
    • The previous record with Surrogate Key 2 is marked as closed (historical), with a new ending date and a current record flag set to 0.

3. Update Name for Product P3:

  • Action: The name information for product P3 is updated in the operational system.
  • Impact on Data Warehouse:
    • The existing record with Surrogate Key 3 is updated to reflect the new name.
    • Since no structural change in attributes occurs (only the name changes), no new record is created.

4. New Product P5 Added:

  • Action: A new product P5 is added to the operational system.
  • Impact on Data Warehouse:
    • A new current record is created with a new Surrogate Key (Surrogate Key 7).
    • This new product is inserted into the Data Warehouse as a current record.

 

5. Creating a Knowledge Module (KM) to Implement the Behavior:

  • Metadata Requirements:
    • The Knowledge Module (KM) must know the following attributes to perform the necessary updates:
      • Surrogate Key: A unique key for each record that will be used to track changes.
      • Natural Key: A unique identifier for the product (e.g., product ID) that remains the same across different versions of a product.
      • Start Date: The date when the record becomes valid.
      • End Date: The date when the record is no longer valid (if applicable).
      • Current Record Flag: A flag indicating whether the record is the current version.
  • Storing Metadata:
    • Oracle Data Integrator (ODI) stores this metadata in the Slowly Changing Dimension Behavior field in the Description tab for each attribute in the model.
  • Accessing Metadata in IKM:
    • While populating a datastore in a mapping, the IKM (Integration Knowledge Module) accesses this metadata using the SCD_xx selectors.
    • These selectors can be used with the getColList() substitution method to retrieve the relevant column information required for the Slowly Changing Dimension (SCD) handling.

 

Summary of Steps:

  1. Update in operational system triggers updates in the Data Warehouse (e.g., new record creation, old record closing, or record updating).
  2. Create a Knowledge Module (KM) that uses metadata (e.g., surrogate key, start date, etc.) to handle SCD behavior.
  3. Use SCD_xx selectors and getColList() substitution method in Oracle Data Integrator to apply the necessary changes.

This structure ensures the Data Warehouse accurately reflects changes in the operational system while maintaining historical data.

 

 

Here’s a step-by-step breakdown of how Oracle Data Integrator (ODI) implements Type 2 Slowly Changing Dimensions (SCD):

1. Drop and Create the Integration Table

  • Action: Drop the existing integration table (if it exists) and create a new one in the staging area.
  • Purpose: The integration table serves as an intermediary storage area where the data will be loaded and processed before being transferred to the target dimension table.

2. Insert Flow Data into the Integration Table

  • Action: Insert the data from the source (flow data) into the integration table.
    • Use mappings that apply only to the natural key (the unique identifier for the record) and the overwrite on change and add row on change attributes.
    • Set the starting timestamp to the current date and the ending timestamp to a constant (often a far future date like January 1, 2400, to signify that the record is active).
  • Purpose: To prepare the data for comparison and update, marking the start of the record’s validity.

3. Recycle Previous Rejected Records

  • Action: Recycle or reprocess any previously rejected records.
  • Purpose: Ensure that previously rejected records are handled and reinserted into the integration table if they meet the criteria.

4. Call the CKM (Check Knowledge Module) for Data Quality Check

  • Action: Call the Check Knowledge Module (CKM) to perform a data quality check on the flow data.
  • Purpose: Ensure the integrity and accuracy of the data before further processing and inserting it into the target.

5. Flag Records for Update (U)

  • Action: Flag records in the integration table as 'U' (update) if:
    • The natural key and the add row on change columns have not changed when compared to the current records in the target.
  • Purpose: Mark records that need to be updated in the target data warehouse for further processing.

6. Update the Target with 'Overwrite on Change' Records

  • Action: Update the target dimension table with the columns flagged for overwrite on change using the data from the integration table.
    • Filter the records using the 'U' flag to identify which records need updating.
  • Purpose: Ensure that only the relevant records in the target are updated, based on the changes detected.

7. Close Old Records

  • Action: For records in the integration table that correspond to existing records in the target (based on the natural key), close the old records by:
    • Setting their current record flag to 0.
    • Setting their ending timestamp to the current date, indicating that these records are no longer active.
  • Purpose: Mark the old records as historical and end their validity in the data warehouse.

8. Insert New Changing Records

  • Action: Insert the new or changed records into the target table with the following settings:
    • Set the current record flag to 1, indicating that these records are active (current).
  • Purpose: Add the new versions of the records that have changed to the target dimension table.

9. Drop the Integration Table

  • Action: After all updates are made to the target dimension table, drop the integration table.
  • Purpose: Clean up the staging area by removing the temporary integration table that was used to process and load the data.

 

Summary of Steps:

  1. Drop and create the integration table.
  2. Insert flow data into the integration table with appropriate flags and timestamps.
  3. Recycle rejected records.
  4. Perform data quality check using the CKM.
  5. Flag records for updates if the natural key and change columns haven't changed.
  6. Update target dimension table with the records flagged for overwrite on change.
  7. Close old records by updating their current record flag and ending timestamp.
  8. Insert new records with the current record flag set to 1.
  9. Drop the integration table after all changes are applied.

 

Additional Considerations:

  • Tuning and Optimization: In some cases, the SQL produced by ODI may require further tuning and optimization to improve performance, especially for large datasets.

 

No comments:

Post a Comment