Data Warehouse Incremental Load - Part 2

by Nicolae Guse 16. September 2023 13:32

Looking at the Delta Engine approach in more detail

 

Some General Considerations

First and foremost, one pre-requisite when talking about Delta Engines, is to already have an Atomic Level implementation, as we’ve already discussed into  Data Warehouse Atomic Level  article

 As soon as you start aggregating data, it becomes much more difficult to push-up changes in the underlying transactions tables, as you may ending up having a single aggregated line that combines both data that changed and data that didn’t changed

o   Therefore, in order to make the Incremental load approach also work for aggregated tables you would need to have a link table that would contain the mapping between:

§  The Primary Key from the Detailed Transaction Table (N)

§  The Primary Key of the Aggregated table (1)

o   This type of approach based on Link tables would still allow you to use the DELTA tables in order to determine which are the impacted lines from the Aggregated table that you need to reprocess  

Another thing you will have to consider, BEFORE moving forward with a DELTA engine approach, is HOW MUCH your data changes:

o   A DELTA engine approach is VERY useful when you have a LOW volume of changes compared to the overall number of records for your criteria

o   For example, when talking about Invoicing, that’s a good use case for an Incremental load, as you typically should have:

§  Most records being inserted will be new invoices

§  Some past invoices may have some updates

§  Almost no deleted records

o   HOWEVER, if you have something like Forecasted Sales, where, in your Source Application, for a Store and a Future Period, data gets FULLY deleted and re-inserted:

§  a DELTA Engine is definitely NOT the best approach

§  going for a more traditional batch load approach, where data gets fully deleted for the specified Store and Period should provide a higher performance

 Moving forward, you will have to think of a logic to save a history of the data from the Staging tables that were used to load data into your Fact table. Let’s think of the following scenario:

o   Depending on your specific business scenario, having new transactions into an Imported table DOESN’T necessarily mean that the data can be loaded into your Fact table

o   This would mean that you may sometimes need an additional Log table into your Source Application that would tell you that data is READY to be processed. Without such Log table available you may have a hard time to determine that the data coming from the Source Application is complete enough in order to qualify to be loaded into the Data Warehouse

§  As an example, you may have an Invoicing process that needs to be completed for a Store on a given day, in order for the Invoice data to reach the final form on which qualifies for Data Warehouse ingestion

§  At the end of that Invoicing process into your Source Application ERP, a line would end-up being written into a LOG table in the application side what would notify the Data Warehouse that a new version of the Invoice data is ready to be imported

o   Having such LOG table available would also help you to further optimize the import of your Staging Transaction tables in the Data Warehouse, as it would allow you to reduce the volume of imported data to just the Stores and the Days on which the Invoicing process was completed, since the last time data has been successfully loaded in the Data Warehouse

o   From Data Processing point of view, this would mean that, at the end of your FACT_INVOICE table load, you will have to maintain an equivalent HISTORY_STG_INVOICE_DTL table, based on the following logic:

§  Let’s assume that your STG_INVOICE_DTL gives the granularity of your FACT_INVOICE table (ex. INVOICE_ID, INVOICE_LINE)

§  As result of running a DELTA Engine you will end-up with a DELTA_STG_INVOICE_DTL, that will return the differences between:

·         HISTORY_STG_INVOICE_DTL – contains the Invoice lines that have been successfully integrated into your FACT_INVOICE table

·         CRNT_STG_INVOICE_DTL – contains the data corresponding to the latest data import -> ex. You’ve imported data for 10 Stores for the month of June, which just completed their Invoicing process for the day

·         If there is a LOG table into your Source Application telling you which Stored have finished their invoice process for a given period, that would be super helpful, as you can compile a WHERE clause to be passed to the DELTA Engine to further reduce the scope of the comparison       

·         Overall, your DELTA_STG_INVOICE_DTL will contain the list of INSERTED/UPDATED/DELETED records in the Source Application INVOICE_DTL table, since the previous successful load of the Data Warehouse FACT_INVOICE table

§  Once the FACT_INVOICE has been refreshed based on the changes from DELTA_STG_INVOICE_DTL:

·         the aim is to replicate these changes into the HISTORY_STG_INVOICE_DTL

·         At the end of the process, what you have in the HISTORY_STG_INVOICE_DTL should replicate what you have in the Source Application INVOICE_DTL table, for the list of Stores and periods which have been processed

Some Technical Considerations

 

From technical perspective, you should target the development of a reusable Stored Procedure (ex. PRC_DELTA_ENGINE) that would receive a couple of parameters:

o   MANDATORY Parameters:

§  SOURCE_TABLE_NAME

§  TARGET_TABLE_NAME

§  DELTA_TABLE_NAME – the name of the delta table that will contain the changes detected between SOURCE and TARGET tables

§  JOIN_COLUMN_LIST

·         Will contain a comma separated list of columns that would be used to join the SOURCE_TABLE_NAME & TARGET_ORBJECT_NAME

·         This means that this list of columns should form a PRIMARY KEY for each of the above objects

§  COMPARE_METHOD

·         Will specify the method used for compare, and should integrate support for the following compare methods

o   “OR”

§  Using this compare method would result in the construction of a WHERE clause to determine the differences, that would look like:

·         WHERE S.COL1 <> T.COL1 OR S.COL2 <> T.COL2 OR ….

o   “HASH”

§  Would implement the usage of a HASH function in order to determine the differences, resulting in a WHERE clause that would look like this:

·         WHERE HASH(S.COL1,S.COL2,S.COL3,…) <> HASH(T.COL1,T.COL2,T.COL3,…)

§  You need to be aware that, depending on the specific HASH function and algorithm implemented in your database engine of choice, there will be a smaller or greater potential of hash collisions -> different values resulting in the same hashed key

§  In practice, in order to accelerate the HASH comparisons, I would strongly recommend that you save the generated HASH(S.COL1,S.COL2,S.COL3,…) in a HASH_KEY column, when loading data in the source table (ex. HISTORY_STG_INVOICE_DTL). This can significantly speed-up the compare operations

§  This would result into a WHERE clause that would look like this:

·         WHERE S.HASH_KEY <> HASH(S.COL1,S.COL2,S.COL3,…)

·         Regardless of the COMPARE method used, only the COMMON columns will be used for the differences check

o   OPTIONAL Parameter:

§  WHERE_CLAUSE

·         This parameter will be used to filter the data from both SOURCE_TABLE_NAME and TARGET_TABLE_NAME, in order to reduce the data set on which the Data compare will be performed

o   Ex. WHERE STORE_ID IN (‘NY1’,’OC1’,’WA1’) AND PERIOD = 202308

§  EXCEPTION_COLUMN_LIST

·         The list of columns that will be excluded from the comparison -> ex. DATE_CREATION,DATE_MODIFICATION,BATCH_ID

§  HASH_COLUMN_NAME

·         The name of the HASH column  to be used for comparison -> ex. HASH_KEY_INVOICE_DTL

§  FORCE_METADATA_REFRESH

·         Will ALLOW the values “Y” OR “N” / NULL

·         For performance perspective:

o   you should AVOID re-generation of the SQL statement used to detect the differences. In order to generate such statements would involve:

§  Querying various system tables

§  Different forms of WHILE / FOR clauses, string processing, etc.

o   It makes sense to save the DELTA compare statement into a table like DELTA_COMPARE_METADATA

·         In this way, when the procedure PRC_DELTA_ENGINE would be launched with FORCE_METADATA_REFRESH = NULL, it would just recover the compare statement from DELTA_COMPARE_METADATA table and execute it

 

  •  In practice, the primary purpose of the DELTA Engine procedure would be to generate a DELTA table (ex. DELTA_STG_INVOICE_DTL) that would contain:

o   ALL columns from TARGET_TABLE_NAME

o   A technical column called OPERATION_TYPE, that would mark the type of operation that has been detected for the line:

§  “D” – corresponds to DELETED rows, respectively rows which were found in SOURCE_TABLE_NAME for the specified filters in the WHERE_CLAUSE, but are now MISSING from TARGET_TABLE_NAME, using the same filters

§  “U” – corresponds to UPDATED rows, respectively to COMMON rows betweenSOURCE_TABLE_NAME & TARGET_NAME, on which we have differences on the columns

§  “I” – corresponds to INSERTED rows, respectively to rows which are ONLY found inTARGET_TABLE_NAME

  • This means that the DATA compare statement saved into DELTA_COMPARE_METADATA table would LOOK something like this, when using COMPARE_METHOD = “OR”:

CREATE OR REPLACE TABLE DELTA_STG_INVOICE_DTL

AS

SELECT ‘D’ AS OPERATION_TYPE, S.*

FROM SOURCE_TABLE_NAME S

WHERE NOT EXISTS

(

SELECT 1 FROM TARGET_TABLE_NAME T WHERE T.JOIN_COL_1 = S.JOIN_COL_1 AND …

)

UNION ALL

SELECT ‘I’ AS OPERATION_TYPE, T.*

FROM TARGET_TABLE_NAME T

WHERE NOT EXISTS

(

SELECT 1 FROM SOURCE_TABLE_NAME S WHERE T.JOIN_COL_1 = S.JOIN _COL_1 AND …

)

UNION ALL

SELECT ‘U’ AS OPERATION_TYPE, T.*

FROM SOURCE_TABLE_NAME S

INNER JOIN TARGET_TABLE_NAME T ON T.JOIN_COL_1 = S.JOIN _COL_1 AND …

WHERE S.COL1 <> T.COL1 OR S.COL2 <> T.COL2 OR ….

Depending on your particular needs, can make perfect sense to save a full history data from the DELTA table (ex. DELTA_STG_INVOICE_DTL) into a dedicated table like HISTORY_DELTA_STG_INVOICE_DTL, that would additionally contain additional columns like BATCH_ID (the ID of the ETL process run that generated the data)

o   In this way you will be able to identify which ETL run generated a particular set of DELTA records, that would be very useful for Audit purposes

 

Normally, at the end of this second part of the article, we should have a pretty good idea what a DELTA engine should do and how. But now we are reaching the really challenging part, respectively how we can design our ETL flows in order to use these DELTA records. But more on this on the   Data Warehouse Incremental Load - Part 3  article.

Tags:

Data Warehouse Best Practices

Add comment

biuquote
  • Comment
  • Preview
Loading

RecentPosts

Disclaimer

The ideas you'll find on this site represent the opinion of their owner. They are not either about or being endorsed by a specific company. This site is a place of sharing ideas and trying to find together best practices on Data Warehouse and Business Intelligence projects.

Some of the technical approaches will be praized, some of them will be criticized, based on the concepts presented by their owners. You shoudn't feel offended if some of the ideas which either you or your company consider best practices, here are being criticized. You should also realize that what some people consider best practices on a particular system and business scenario, can be worst practices on another.  If you don't agree with a particular idea, feel free to comment it in a civilized manner.

Only by crashing different ideas and experiences we will find a better way of doing Data Warehouse and Business Intelligence projects.