Today we are embarking on a journey to explore the world of the challenges related to building an Data Warehouse Incremental Load solution. But lets take it step by step, and see where it goes.
General Context
A while back we’ve discussed in the article Real Time Data Warehouse - Is it really possible and do we really need it? about some of the challenges faced when moving towards a Real Time Data Warehouse architecture. This is still looked upon as the Holly Grail of Data Warehouse design: the ability to pick-up changes in your Source Application and to immediately propagate them to the Data Warehouse system and to your reporting Application.
Before moving forward, lets clarify a bit why this need of Real Time Data Warehousing:
· Is it useful for real Business Intelligence decision making for a company Top Management?
o On that I would typically answer with a BIG NO. You need to properly understand the data you are looking at, you need to discuss about it, and take informed decisions. And either way you’re looking at, this is not a Real Time process.
o Having the latest invoice from your ERP system continuously flowing to your Data Warehouse shouldn’t have a significant impact over the decisions your Top Management is taking.
· Or is it rather useful for Operational Activities?
o As we’ve discussed in the past, there is A LOT of operational activities performed on top of your Data Warehouse
§ You have people like Accountants & Cost Controllers that are using the Data Warehouse as a place to consolidate data coming from very different systems, and as a place where business rules are added in order to enrich the content of the data and to bring it to a common denominator
§ These are mostly the people who are VERY interested how having the changes in the different applications pushed in real time to your Data Warehouse, as they have tight deadlines, and they are using the Data Warehouse as a final part of a data validation cycle
§ If they don’t like the data they end-up with, they will go back to the Source Application, correct the data, and expect to see those changes as fast as possible in your Data Warehouse
Somehow, as a Data Warehouse developer, you need to develop solutions that would ensure this move towards Near Real Time Data Warehouse, of course, with minimal costs 😊
Overall, to build your path towards Near Real Time Data Warehouse, you will have to address 2 different sets of challenges:
- Capturing the changes from the Source Application
- Propagating these changes at Data Warehouse level
And, of course, each one of them will come with its own challenges
It all starts with capturing the changes from your Source Applications
There are a couple of methods of capturing those changes, assuming we are talking of relational database systems. And yes:
- there is a big world out there, that goes well beyond relational, but let’s keep things simple for starters.
- Not to mention that sometimes you don’t even have direct access to the Source Application database, and instead you have to use API calls to extract data you need. Which is ANYTHING but fast.
There are a few typical ways of capturing changes from a relational database system:
Using AUDIT fields like DATE CREATION, DATE MODIFICATION, ARCHIVE to detect the changes
- Not quite the best solution as you’re relying on the particular implementation the developers of you Source Application made
- And the more different Source applications you’re using for the Data Warehouse, the more different approaches you have. And guess who will have to take all these differences into account when needing to propagate it at Data Warehouse Level 😊
- This approach works best where there are NO physical deletes applied at table level, with a record being deleted being marked by the application with something like ARCHIVE = 1
- However, real life experience has showed me that at some point someone from Application Support team will have to run sooner or later a DELETE script. And this is where this approach will be crumbling down, since as soon a record is physically deleted, you will NOT have a simple way to detect and propagate this change at Data Warehouse level
Using Custom Defined Database Triggers to detect the changes
- This type of solution would involve defining, a trigger for each table you intend to capture changes from, for the standard INSERT/UPDATE/DELETE operations.
- Each time when these operations occur on the source table, the trigger will perform an insert into a custom table holding these changes to be consumed
- HOWEVER, even when the Source Application database engine supports triggers (which is NOT always a taken), it can result into significant slowdowns of the Source Application, as more and more data is being processed. Which is DEFINITELY NOT the way to go
Using Change Data Capture (CDC) based on Database Log data
- This is the preferred method used by many commercial solutions on the market in order to detect and consume changes
- Such solutions are either provided by the Database Vendor (ex. Oracle GoldenGate) or by independent vendors like https://www.confluent.io or https://www.striim.com
- From performance perspective they are the best solutions as they are relying on the Database Transaction Log in order to detect the changes, and therefor don’t introduce significant additional performance overheads like using triggers
- Using a Third-party Change Data Capture solution also has the advantage that it can support multiple database engines, which can definitely help if you are using different Source Applications from different vendors
o However, this can also be somehow a disadvantage, as these third-party solutions are relying on database log systems that can be changed without notice by the database vendors, which can break their solutions. But on this is still the responsibility of the Third-party Change Data Capture vendor to patch their solutions to maintain compatibility with the latest Database Engine versions
o The BIG advantage of these third-party solutions is that they can provide a UNITARY solution from consuming the changes from DIFFERENT database systems, which can be golden for Data Warehouse developers
- From reliability solution, the solutions offered by the Database vendors are the MOST reliable, as they will continue to adapt their CDC solutions as their database logging solutions evolve
o BUT, of course, going for dedicated CDC solutions from each database vendor will quickly become EXPENSIVE
o Not to mention that, at the end you will still have to spend a significant effort to integrate all these DIFFERENT solutions at Data Warehouse level
- Overall, this would be the solution I would go for, IF I would have the budget for it
- Still, since nothing is perfect on this world, you also need to be aware that there are also some pitfalls of this method:
o It will capture the changes that happened in the Source Database as soon as they occur, which may or may NOT be a great idea
o For example, often in an ERP system various processes are being run which may DELETE / INSERT / UPDATE a row multiple times until reaching the latest state of the record
o If the Data Warehouse would capture and apply each of these changes individually, could result in a lot of work for mostly nothing. So you definitely need to be extra careful when processing the captured changes that it really make sense to load them into the Data Warehouse
o What could make more sense is to connect the CDC for a set of tables with the availability of some LOG / TRIGGER tables on the Source application side, that would tell you when it would make sense to PULL the changes on the transaction tables, WHILE ONLY retrieving the LATEST version of the record, instead of the INTERMEDIARY ones
Using Delta Approach to detect changes at Table Level
- This approach involves comparing 2 versions of the data for the same table:
o The version of the data from the previous import -> PREV_STG_TBL
o The version of the data from the current import -> CRNT_STG_TBL
- Of course, when talking about importing a table this makes it anything but real-time, and it involves import a LARGE or the ENTIRE table from the Source Application
o Still, you should be able to use some calculated import filters to further slice the data you import
o For example, if you know that 99% of the changes on your data occurred in the last 3 months, each time when you import the data you could add this filter at import
o And just to be on the safe side, to cover those eventual 1% changes outside the 3 months, you could import the data for the current Fiscal Year 1/day and process these changes if they occurred
o Of course, all these assumptions should be validated as acceptable with your business
- However, one reality of Data Warehouse processing is that often it takes significantly more time and resources to PROCESS the data and to load it into a Fact table than to import it
o Importing data can be very light on the processing side
o While processing data will be HEAVY on the CPU time, as loading scripts typically contain complex business logic to apply. The more transactions you process, the more time will take for the loading scripts to run
o And here is where these DELTA tables really start to make a difference, as it would allow you to process ONLY the transaction data that changes
- You can build a Generic Delta Engine mechanism that would allow you to compare a Source and Target tables, and to store these differences into a DELTA table. In the scope of this article, this will be the option we will be exploring the most
- This approach would allow you to decouple from the Database Change Tracking Features offered from the DIFFERENT database engines you may be importing data.
o You can as well import the data from the CSV
o As long as the imported data ends up in a Staging table, you are good to go for the Delta approach
· Thinking out-loud, if I would like to check what we can do with an Incremental Load approach, I would start with this solution, as it would be quite easy to develop and would allow to test different approaches for Incrementally loading the changes into the Data Warehouse tables.
In the next article Data Warehouse Incremental Load - Part 2 we will explore more what is this Delta Engine all about.
And now, back to you: Have you worked so far with Data Warehouse Incremental Load implementations? What type of approaches have worked best for you?
Cheers,
Nicolae