There is an old saying which states "You can't unscramble scrambled eggs". Surprisingly, the same thing applies also to Data Warehouse Atomic level.
Before taking a deep dive into the subject, we should clarify what this Atomic Level stands for: it means adding in the fact table the finest level of granularity available in the Source application, in order to allow full tracing from the Data Warehouse back to originating data source.
I think here is the moment where we should make a step back, and start thinking how Data Warehouse appeared in the first place.
Usually, it all starts with the business request coming from one department, which asks for for a report containing a certain information. Slowly these report requests start to pile up and they form the foundation for what will become the company Data Warehouse. From technical point of view, very different paths are available:
1. Design the fact tables just to contain information for the dimensions that the business user requested. Everything related to having the atomic level into the fact table is additional work, which at the end translates into additional costs that the users won't necessary agree to pay, especially if the disadvantages aren't clearly stated or if there is huge business pressure to deliver the end result very fast. You typically end up in this scenario when the IT department starting the Data Warehouse doesn't have enough "weight" or experience in order to really influence the end solution, and you end up having the customer dictating the technical solution. Which may come back to hunt you later on.
2. Design the fact tables right from the beginning with atomic level in mind, as a general design rule. It will be the IT team job to push hard for this design feature to the Data Warehouse Business Owners, by explaining them that this approach ensures that the Data Warehouse will be able to answer not only the current business questions for the requested level of granularity, but in fact any business questions related to that business subject, as long as the information is available in the source system. I'll call this the "collaborative" approach.
Now, lets think of the following business scenario: the Sales department asks for a report which contains the daily sales, per product, aggregated per Corporate Customer, since this is the level on which invoices are being issued. You also have, at invoice line level, the individual Customer/Store to which you are delivering each product, but the business says that they will never need to follow sales at this level. Since they say never, you go ahead and design a fact table which contains the Customer Group as the lowest level of granularity, and you quickly deliver the requested report to the business user. And every one lived happily ever after.
Well, not really. After some time, the business model changes, and now the Sales department says that they want to trace the Sales at Customer level, and they want to have the change applied not only for the new records, but also for the existing historical data. And this is where the IT department really starts to scratch their head since, once you've aggregated data and stored it like this in the fact table, you cannot go back to a finer level of detail, at least not without recomputing the entire fact table, which might not always be desirable or possible.
There are a few lessons to learn from this scenario:
● Don't let the business user dictate the level of detail in the fact table. It's normal for him to dictate what data he should have in the report, but it's up to the IT department to make sure that they always design fact tables with atomic level in mind. This approach ensures that you'll be able to rollup the data using dimension hierarchies at whatever level the user wants, while allowing him to go back to the finest level of detail when needed.
● Always bring the level of detail from the Source System. Lets think of the following example: you have a fact table which contains the sales / customer, product at daily level. And this is fine, since this is the lowest grain for the dimensions the user wanted their reports. He definitely doesn't want to see details about the Invoice No and Invoice Line in his report, as they make no sense in the decision making process.
And everything is fine until one day, when one of the Sales Managers says that the data for a given day is not correct, since it doesn't match the data from the source ERP system. And then they realize that, even if the Invoice details are not useful in the decision making process, it's very useful in troubleshooting Data Quality issues. The only little problem is that, once you've started aggregating data, you end up with that old saying: "You can't unscramble scrambled eggs". And this is why is so important to have the Atomic Level of detail implemented right from the beginning.
Now that we've cleared up a bit the need of having the Atomic Level implemented at fact table level, we need to analyze which is the best technical solution in order to do that:
1. We can create a Junk Dimension (ex. REF_INVOICE_DTL) which contains all Invoice Details we are interested to expose to the user. But wait, what is a Junk Dimension? It's a dimension which contains a set of non related attributes, which typically don't form a hierarchy, and are rarely used for reporting purposes, therefore it doesn't make sense to create an individual Dimension table for each one of them. The table will contain the following 3 main category of fields:
i.Surrogate Key (ex. INVOICE_DTL_KEY)
ii.Primary Key (ex. INVOICE_NO,INVOICE_LINE_NO)
iii.Other fields (ex. BUSINESS_DEPT, GRANTED_DISCOUNT)
○ Advantages:
i.In order to minimize the size increase of the FACT table, just the Surrogate Key will be added to the table.This is very important since since in a table data is stored in rows organized into pages, and a page contains all fields. This means that, even when you need just one field, the whole page, containing all the columns, is read from the disk, and afterwards just the desired field is selected. Based on this logic, you are VERY interested to keep at minimal the number of fields from the fact table.
ii.When using a standard ROLAP engine like Business Objects, it performs the join with the Junk Dimension (REF_INVOICE_DTL) table only when one of the Invoice Details is used in the report, and the join is performed on an integer (Surrogate Key), which is both very small and can be indexed.
iii.The best solution if your Database Engine has no support for ColumnStore indexes
○ Disadvantages:
i.We end up with a very large REF table, potentially as large as the FACT table, since the Invoice Line is the most detailed information coming from the source system. Joining 2 large tables can degrade performance, even while using an index on the SK
ii.The SK column values (INVOICE_DTL_KEY) cannot be compressed, since we will have as many distinct values as lines available in the FACT table. This typically means that an index on this column in the FACT table doesn't provide a dramatic performance improvement.
2. The second option is a variation of the Junk Dimension approach, with the difference that, instead of adding the Surrogate Key to the FACT table, we add directly the columns which form the Natural Key.
○ Advantages:
i.As long as the columns which form the Natural Key are also fields used in user reports, no join is needed with the Junk Dimension table
ii.Compared with the version based on Surrogate Key approach, the columns which form the Natural Key are highly compressible, which also makes them perfect candidates to include them in a ColumnStore index. But wait, what's this ColumnStore index all about? For example, think how many distinct values for INVOICE_LINE_NO you can have, compared with the total number of lines in the FACT table. Since a ColumnStore index records the number of occurrence of each distinct value, the performance increase is dramatic, especially when applying functions like SUM/COUNT/AVG, due to the fact that physically fewer data is read from the disk.
Additionally, another advantage of the ColumnStore index is that, since it uses a vertical storage, different from the traditional row storage, as long as you select just columns included in this index, only those columns will be read from the disk, which improves performance significantly compared to the traditional row/horizontal storage. And these are some of the reasons why the ColumnStore index is considered the latest wunderkind in the Data Warehouse world.
○ Disadvantages:
i.When needing to recover data from the Junk Dimension, the join based on the Natural Key is significantly slower than the one based on the Surrogate Key, since it includes various fields with different data types. Additionally, the size of such index will be significantly larger than the one generated in the Surrogate Key approach, which is of type integer.
ii.We are slowly moving away from the traditional Fact table design, which states that such table should only contain Foreign Keys, pointing to the Dimension Surrogate Keys, and numeric fields for Measures (ex. Amount, Quantity), since we need to add the columns which make the Natural Key of the Junk Dimension
iii.At least on the Microsoft SQL Server 2012 implementation of ColumnStore index, once such index is created, no INSERT/UPDATE/DELETE can be performed on the table. While the performance boost offered by such index makes it ideal for Data Warehouse Reporting needs, it also makes it much more difficult to update the data. Well, unless you're using Partition Switching in order to refresh the data, which becomes the best solution to refresh data on tables with ColumnStore index. But more details about Partitioning mechanism can be found in the article series The Name of the Game - Table Partitioning.
3. The third option is to proceed with the creation of a Degenerate Dimension, that basically involves adding all the fields that we need, for which you don’t have a separated dimension table, directly into the fact table:
○ Advantages:
i.Best query performance, no matter the column used in the reports, especially when proceeding with the creation of a ColumnStore index, due to the high compressibility of the details. For example, in the case of INVOICE_LINE_NO field, we obviously have far less distinct values than the total number of lines in the fact table
ii.All information we need is contained exclusively in the FACT table, which removes the need of a separated Junk Dimension table, having the same number of line as the FACT table
○ Disadvantages:
1. The design is anything but pretty. We are so far away from the traditional design based on Foreign Key and numerical columns for measures, since we add dozen of columns of various data types. At the end we have something like an Abstract Painting, where every sense of order seems to have gone. Definitely not for everyone liking.
2. It's true that it might provide the best query performance, especially when using ColumnStore index, but just a few Database Engines support Vertical storage. If the Database Engine you use doesn't support the feature, it's far more efficient to use the approach based on Surrogate Key
3. The same limitation specific to the data refresh for the tables using ColumnStore index is very much still in place
At the end of this article, I would say that there is no perfect solution in order to implement the Atomic Level of detail, each one having its own advantages of limitations. What's really important is to make sure that your Data Warehouse does have such level of detail implemented, since allows your users to take their investigation down to the finest granularity available from the source system. For the Technical Team such approach helps them consolidate the Data Warehouse foundation, by imposing a standard which all other developments will follow. And everyone lived happily ever after. Really.
And now, back to you: Does your Data Warehouse implement the Atomic Level of detail for the Fact tables? Which were the challenges that you were facing during this implementation?