From technical perspective, you should target the development of a reusable Stored Procedure (ex. PRC_DELTA_ENGINE) that would receive a couple of parameters:
o MANDATORY Parameters:
§ SOURCE_TABLE_NAME
§ TARGET_TABLE_NAME
§ DELTA_TABLE_NAME – the name of the delta table that will contain the changes detected between SOURCE and TARGET tables
§ JOIN_COLUMN_LIST
· Will contain a comma separated list of columns that would be used to join the SOURCE_TABLE_NAME & TARGET_ORBJECT_NAME
· This means that this list of columns should form a PRIMARY KEY for each of the above objects
§ COMPARE_METHOD
· Will specify the method used for compare, and should integrate support for the following compare methods
o “OR”
§ Using this compare method would result in the construction of a WHERE clause to determine the differences, that would look like:
· WHERE S.COL1 <> T.COL1 OR S.COL2 <> T.COL2 OR ….
o “HASH”
§ Would implement the usage of a HASH function in order to determine the differences, resulting in a WHERE clause that would look like this:
· WHERE HASH(S.COL1,S.COL2,S.COL3,…) <> HASH(T.COL1,T.COL2,T.COL3,…)
§ You need to be aware that, depending on the specific HASH function and algorithm implemented in your database engine of choice, there will be a smaller or greater potential of hash collisions -> different values resulting in the same hashed key
§ In practice, in order to accelerate the HASH comparisons, I would strongly recommend that you save the generated HASH(S.COL1,S.COL2,S.COL3,…) in a HASH_KEY column, when loading data in the source table (ex. HISTORY_STG_INVOICE_DTL). This can significantly speed-up the compare operations
§ This would result into a WHERE clause that would look like this:
· WHERE S.HASH_KEY <> HASH(S.COL1,S.COL2,S.COL3,…)
· Regardless of the COMPARE method used, only the COMMON columns will be used for the differences check
o OPTIONAL Parameter:
§ WHERE_CLAUSE
· This parameter will be used to filter the data from both SOURCE_TABLE_NAME and TARGET_TABLE_NAME, in order to reduce the data set on which the Data compare will be performed
o Ex. WHERE STORE_ID IN (‘NY1’,’OC1’,’WA1’) AND PERIOD = 202308
§ EXCEPTION_COLUMN_LIST
· The list of columns that will be excluded from the comparison -> ex. DATE_CREATION,DATE_MODIFICATION,BATCH_ID
§ HASH_COLUMN_NAME
· The name of the HASH column to be used for comparison -> ex. HASH_KEY_INVOICE_DTL
§ FORCE_METADATA_REFRESH
· Will ALLOW the values “Y” OR “N” / NULL
· For performance perspective:
o you should AVOID re-generation of the SQL statement used to detect the differences. In order to generate such statements would involve:
§ Querying various system tables
§ Different forms of WHILE / FOR clauses, string processing, etc.
o It makes sense to save the DELTA compare statement into a table like DELTA_COMPARE_METADATA
· In this way, when the procedure PRC_DELTA_ENGINE would be launched with FORCE_METADATA_REFRESH = NULL, it would just recover the compare statement from DELTA_COMPARE_METADATA table and execute it
- In practice, the primary purpose of the DELTA Engine procedure would be to generate a DELTA table (ex. DELTA_STG_INVOICE_DTL) that would contain:
o ALL columns from TARGET_TABLE_NAME
o A technical column called OPERATION_TYPE, that would mark the type of operation that has been detected for the line:
§ “D” – corresponds to DELETED rows, respectively rows which were found in SOURCE_TABLE_NAME for the specified filters in the WHERE_CLAUSE, but are now MISSING from TARGET_TABLE_NAME, using the same filters
§ “U” – corresponds to UPDATED rows, respectively to COMMON rows betweenSOURCE_TABLE_NAME & TARGET_NAME, on which we have differences on the columns
§ “I” – corresponds to INSERTED rows, respectively to rows which are ONLY found inTARGET_TABLE_NAME
- This means that the DATA compare statement saved into DELTA_COMPARE_METADATA table would LOOK something like this, when using COMPARE_METHOD = “OR”:
CREATE OR REPLACE TABLE DELTA_STG_INVOICE_DTL
AS
SELECT ‘D’ AS OPERATION_TYPE, S.*
FROM SOURCE_TABLE_NAME S
WHERE NOT EXISTS
(
SELECT 1 FROM TARGET_TABLE_NAME T WHERE T.JOIN_COL_1 = S.JOIN_COL_1 AND …
)
UNION ALL
SELECT ‘I’ AS OPERATION_TYPE, T.*
FROM TARGET_TABLE_NAME T
WHERE NOT EXISTS
(
SELECT 1 FROM SOURCE_TABLE_NAME S WHERE T.JOIN_COL_1 = S.JOIN _COL_1 AND …
)
UNION ALL
SELECT ‘U’ AS OPERATION_TYPE, T.*
FROM SOURCE_TABLE_NAME S
INNER JOIN TARGET_TABLE_NAME T ON T.JOIN_COL_1 = S.JOIN _COL_1 AND …
WHERE S.COL1 <> T.COL1 OR S.COL2 <> T.COL2 OR ….
Depending on your particular needs, can make perfect sense to save a full history data from the DELTA table (ex. DELTA_STG_INVOICE_DTL) into a dedicated table like HISTORY_DELTA_STG_INVOICE_DTL, that would additionally contain additional columns like BATCH_ID (the ID of the ETL process run that generated the data)
o In this way you will be able to identify which ETL run generated a particular set of DELTA records, that would be very useful for Audit purposes
Normally, at the end of this second part of the article, we should have a pretty good idea what a DELTA engine should do and how. But now we are reaching the really challenging part, respectively how we can design our ETL flows in order to use these DELTA records. But more on this on the Data Warehouse Incremental Load - Part 3 article.