Data Warehouse Incremental Load - Part 3

by Nicolae Guse 4. November 2023 13:22

Propagating the changes at Data Warehouse Level

What can I say, time flies and here we are at the 3’rd and last part of the article series dedicated to Incremental Load.

The good news is that, thanks to using tools like a Delta Engine, we can now know what changes have occurred in the Staging table, which we need to propagate to the Data Warehouse Fact table. Now, in theory, things should be quite straight forward:

  • In an IDEAL world, your Staging transaction table is the one which gives the granularity to your fact table

    • This would mean that any records which are marked with OPERATION_TYPE = ‘D’ in our example DELTA_STG_INVOICE_DTL should be directly deleted from the correspondent FT_INVOICE table

    • And ALL records from the DELTA_STG_INVOICE_DTL with OPERATION_TYPE IN (‘I’,’U’) should go thru the regular loading script for the Fact table, which will apply all necessary business rules, and when everything is ready, load them via a MERGE statement into the final FT_INVOICE table, and that’s it

    • Sounds deceivingly simple, doesn’t it? 😊

  • In the REAL world, things are NEVER that simple, and here are some reasons why:  

    • You will always have some join between our example DELTA_STG_INVOICE_DTL and some dimensions tables. And these joins complicate A LOT the situation

    • The scenario that we’ve discussed in the previous articles is that you will ideally have something like a LOG table in the Source application, that will notify you when you need to load the data for a given Store and Period. And these LOG tables are populated when a Cost Controller pushes a button in the application, to send data towards the Data Warehouse

    • The BIG problem is that, in practice, you have little idea WHY the Cost Controller pushed that button:

      • The typical reason can be that new records have been added into the Source Application INVOICE_DTL. If that’s the scenario, than they will end-up in our friendly DELTA_STG_INVOICE_DTL, and then everybody lived happily ever after 😊

      • HOWEVER, we must remember that the Data Warehouse is part of a feedback loop for the Cost Controllers:

        • They are often using the Data Warehouse as a tool to check if the data is correct in the Source application

        • If they detect something is wrong with the data in the Reporting tool on top of the Data Warehouse, they will go back to the Source application and perform some changes, ranging from adding / correction of transactions to adding new setup / business rules

      • In practice, in this second scenario, you will NOT have any changes recorded into the DELTA_STG_INVOICE_DTL, but the data for the Store & Period will still have to be re-processed, since some setup lines will appear into some of the tables that you’re joining with in the script which loads the FT_INVOICE table.

    • And this is exactly where things become complicated, when talking about Incremental Load:

      • From user perspective, he really doesn’t care about the Incremental Load. He just wants to have the latest version of the data for his Store and Period processed

      • On your side, as a developer, you’ll have to explore ways to make Incremental Load as reliable as the regular Batch Load, which was fully reloading ALL data for a Store & Period. If not, Incremental Load would be a failure, since nobody wants data faster, but potentially wrong

      • Here are some options we can explore:

        • For each of the tables involved in joins with our DELTA_STG_INVOICE_DTL, we can consider implementing the same DELTA_% approach

        • This would then also mean that you would end-up having a wonderful triumvirate of tables for each of the tables involved in a join:

          • Let’s assume that originally the STG_INVOICE_DTL was joining with DIM_ACCOUNT o Implementing the DELTA_% approach will mean that you’ll need to have 3 tables:

            • The regular DIM_ACCOUNT

            • The new HISTORY_DIM_ACCOUNT, which would save the records from DIM_ACCOUNT that were actually used to load the data for a given Store and Period, which will be saved under a BATCH_ID (each run of the ETL for a Store and Period will generate a single BATCH_ID). Just be aware that you will only need save the DIM_ACCOUNT records that are used to load your slice of data in the Fact table, and NOT a copy of the entire DIM_ACCOUNT table

            • The new DELTA_DIM_ACCOUNT, which will be populated by our PRC_DELTA_ENGINE, by comparing the latest version of the records from DIM_ACCOUNT needed to process the data for a given Store and Period, with the one from HISTORY_DIM_ACCOUNT, for the same criteria

        • In the REAL world, when the Cost Controller send the data to the Data Warehouse, you can end up in one of multiple scenarios:

          • o Have some data in DELTA_STG_INVOICE_DTL, since some transactions changed, and some data into DELTA_DIM_ACCOUNT

          • o Have data in DELTA_STG_INVOICE_DTL and NO data in DELTA_DIM_ACCOUNT o Have NO data in DELTA_STG_INVOICE_DTL, and some data into DELTA_DIM_ACCOUNT

        • In practice, to handle these challenges, you can build a view that implements the following logic:

CREATE OR REPLACE VIEW VW_STG_INVOICE_DTL_TO_PROCESS AS

SELECT * FROM DELTA_STG_INVOICE_DTL

UNION SELECT A.*

FROM STG_INVOICE_DTL A

INNER JOIN DELTA_DIM_ACCOUNT B ON A.ACCOUNT_ID = B. ACCOUNT_ID
        • Once you modify the script that loads the data into FT_INVOICE to use the VW_STG_INVOICE_DTL_TO_PROCESS, we should be good to go for Incremental Load, as the view will contain ALL transactions needed to process changes that occurred on either STG_INVOICE_DTL or DIM_ACCOUNT

    • Now, this was a relatively simple example:

      • But in practice, in a load script for an FT table, you can have DOZEN of tables involved in joins, either directly with STG_INVOICE_DTL or joins between different DIM tables, as part of a Snowflake data model implementation

      • This means that you may end up needing to build MULTIPLE VW_DIM%_TO_PROCESS objects, to capture all necessary changes. And this can definitely escalade the complexity VERY QUICKLY

  • At the end of the day, it’s all about balance:

    • Just because you can, doesn’t mean you should. Not all scenarios for Incremental Load can be converted easily into viable technical solutions. It’s kind of the same reason why you typically have a collection of tools in the house, and not just one tool to rule them all ðŸ˜Š

    • Let’s not forget that the whole reason for going towards an Incremental Load approach, was to be able to load data SIGNIFICANTLY faster than using the regular Batch Load

    • You must carefully evaluate the scenario on which you want to implement an incremental load. If your load script ends up joining multiple tables, and it’s MANDATORY that ANY change into any of the tables involved to be replicated into the data being processed, you may end up with the classical expression “the cure is more painful than the disease”:

      • Building additional objects needed for the Delta mechanism involves time and increases complexity

      • Running in parallel multiple calls of SP_DELTA_ENGINE will also add to your CPU usage / running time / costs

    • In order to try to take better advantage of the whole Incremental Load approach, you may need to really re-think some of the existing implementations for your fact tables:

      • So far, we’ve determined that the more joins you have in the FT table script, the more complexity you add and the less chance to see a benefit when compared to the regular Batch FULL Load

      • So, you’ll need to think IF a simpler solution, which would propagate the changes from DELTA_STG_INVOICE_DTL as soon as they occur, coupled with a FULL re-processing of the data send 1/day could be an alternative

  • There are out there very viable candidates for building an Incremental Load solution

    • If you ONLY have to push new records into the Data Warehouse, WITHOUT the need to re-process past data, building an Incremental Load solution can provide significant advantages over a Batch FULL Load. This is especially true when there is a BIG disparity between the number of records that would be processed via the 2 approaches

    • Even if you need to propagate changes in the data covering also updates and delete operations, as long as the driver of these changes will be just one or a very small number of Staging tables, an Incremental Load can still be a very good solution

    • The only case when an Incremental Load stops being a viable solution is when the complexity of the scenarios that could trigger a load increases

 

As a conclusion, Data Warehouse Incremental Load continues to remain a challenging topic, to which we’ve only scratched the surface. There is a reason why Real Time Data Warehousing based on Incremental Load is still perceived as the Holy Grail. “It’s not because it’s easy, but because it’s hard”. But there is definitely hope.

There are no simple or magical solutions to complex problems. It’s all about breaking down this complexity into smaller problems, solving them one by one, and, at the end, we’ll get there 😊

And now, back to you: what challenges did you experience with Data Warehouse Incremental Load? And how did you address them?

Tags:

Data Warehouse Best Practices

Add comment

biuquote
  • Comment
  • Preview
Loading

RecentPosts

Disclaimer

The ideas you'll find on this site represent the opinion of their owner. They are not either about or being endorsed by a specific company. This site is a place of sharing ideas and trying to find together best practices on Data Warehouse and Business Intelligence projects.

Some of the technical approaches will be praized, some of them will be criticized, based on the concepts presented by their owners. You shoudn't feel offended if some of the ideas which either you or your company consider best practices, here are being criticized. You should also realize that what some people consider best practices on a particular system and business scenario, can be worst practices on another.  If you don't agree with a particular idea, feel free to comment it in a civilized manner.

Only by crashing different ideas and experiences we will find a better way of doing Data Warehouse and Business Intelligence projects.