One of the things that i've learned so far is that we all talk different dialects, even when we are sharing the same language. One of the strongest challanges that i've meet so far was to make sure that, when we are talking about one concept, we are all talking about the same thing. And this happends beacuse we are different people, we have different ways to learn things, and we have different experiences.
Try to launch a discussion a Data Warehouse developer, a BI Business Analyst, a Project Manager and a BI Director and you will quickly see that, while talking about the same thing, as the Data Warehouse architecture, each of the one's about will understand something different. Therefore, in order to have any kind of progress, you'll have to define a kind of dictionary, that (very unlikely) every one will agree on.
I have no intention to give by the book definitions as accurate as Kimball or Inmon, which are the parents of the Data Warehouse revolution. Their work should be read and appreciated, since without it, we would have nothing now to discuss about. They had a vision and they've fight about it, when everyone else was in the dark.
What i intend to do is to give definitions more related to the business processes that are generating them, more centered to the real life implementation of these concepts, that we see all around us.
So let's start with the basics:
1. ETL Tool (Extract, Transform and Load) - a software product which is used in the following way in the Data Warehouse context:
a. Import and clean the data from various heterogeneous applications systems (the EXTRACT part from the ETL process):
i. Group and Local ERP applications
ii. Internal applications
iii. Web (ex. Exchange rates, Number of distinct users which connect to the company web site)
iv. Other sources (ex. Txt, Xls, Xml files)
b. Converts the imported data to a unique group vision by using a defined set of business rules (the TRANSFORM part from the ETL process).
i. Data is inserted into the source application in order to satisfy a specific, and often localized needs. For example, even within the same ERP, if your company has subsidiaries in different countries, the Accounting records entered into the system will have to correspond to the legal accounting system of each country
ii. However, in order to take a decision at group level, the data must be converted from the LOCAL vision to the GROUP vision, which will allow the decision makers to perform an "apple to apple" analysis, instead of the "apple to peaches"
c. Aggregates the data and prepares them for the Decision Making Process, by loading them into the final stages of the DWH (the LOAD part from the ETL)
· Users don't connect directly to the Data Warehouse, they use a Presentation Layer instead, in order to isolate them from the complexity of the DWH system
Example of ETL product: Microsoft SQL Server Integration Services (SSIS); BO Data Integrator; Informatica; and many, many others. Also let's not forget that in real life, you can have a hybrid solution, where you combine the ETL product with stored procedure, to obtain maximum degree of flexibility.
If it was up to be to put an animal behind it, i would say that, from the Data Warehouse stable, he's the workhorse. At the end, somebody still has to pull the carriage.
2. Data Warehouse Layers:
a. Staging Area (SA)- used for short term storage of data inside the DWH:
i. Import Area (IA)- this is where the data imported from the various sources systems is being stored on. The imported data should be imported while suffering minimal or no data transformations. At this level, the table structure mirrors the data structure from the source systems.
ii. Work Area (WA)- this is where the data from the Import area is being cleaned up, and were the DWH specific business rules are being applied, in order to convert the data from the LOCAL vision of the Source Systems to the GROUP vision specific to the company DWH. At this level, data is highly de-normalized.
b. Actual Data Warehouse (ADW) - used for the long term storage:
i. The data stored here is highly normalized
ii. Contains only:
1. REFERENTIAL (REF) - used to store the attribute which characterize a certain business concept, which will be later one used as one of the axes of analysis. For example, the REF_TIME table contains all attributes that will be later on used to define a time dimension
The REF tables are characterized by the presence of:
a. A NATURAL KEY, usually inherited from the Source System, which does have business meaning outside of the DWH system. This is the field on which the PRIMARY KEY constrain is usually set.
b. A SURROGATE KEY, generated exclusively at DWH level, usually an integer auto increment. This field will only be used to speed up the joins inside the DWH and SHOULD NOT be used outside the DWH.
2. FACT (FT) - contains the transactions which are associated to a very specific business module (ex. FT_SALES). It only contains:
a. SURROGATE KEY's for each distinct leaf level of the transaction on which an analysis must be performed (ex. PRODUCT_KEY, CUSTOMER_KEY, TIME_KEY)
b. MEASURES - usually numeric fields which contain the quantifiable elements of a transaction (ex. AMOUNT, QUANTITY)
3. Data Mart (DM) - is the top layer of the Data Warehouse which is centred around a specific department reporting needs. From design point of view, we can have:
a. A distinct PHYSICAL layer implemented as a database or a collection of tables, separated from the other Data Marts by mnemonic or database schema. Due to the implementation of the PHYSICAL layer, data in here is DENORMALIZED, and fully prepared for the querying process to be performed by the Presentation Layer. This will result in a typical STAR schema implementation
A STAR schema is a DWH design approach where all dimension tables are linked directly to a fact table, with no direct links between dimension tables. Contrary to some of the folklore around, it's still a STAR design even if you have more than one fact tables involved, as long as you don't have direct links between dimensions.
· The GOOD: Better performance; Simpler design for the Presentation Layer; Totally isolates the Presentation Layer from the data load processes occurred in the other layers if the DWH; Work can be more easily parallelized; Simplifies the design of the Presentation Layer; Humans like simple designs :)
· The BAD: By creating a distinct physical layer, it will partially duplicate the data from the ADW layer; Complicates the distribution of updated data from ADW to the Presentation Layer by adding a supplementary physical layer which must be updated
· The UGLY: It just looks like a Renaissance masterpiece, so it's just beautiful. Except you can spend ages during detailed analysis phase and to implement it, for complex business requirements. Can be like in "The Agony and the Ecstasy" novel.
In order to implement a clean STAR design, the data incoming from ADW is de-normalized, by reducing the number of tables based on the following logic:
1. Create a single dimension table (DIM) for each Axe of Analysis. Such a DIM table is usually based on multiple REF tables from ADW layer
2. Create a single or a few Final Fact table (OLAP) for each Data Mart, based on fact tables from ADW business module FACT table. A departmental Data Mart usually consumes data from different business modules.
b. A distinct LOGICAL layer, implemented as a component inside the Semantic Layer within the Presentation Layer, which includes all relations between the involved REFERENTIAL and FACT tables from Actual Data Warehouse. Due to the lack of the of the PHYSICAL layer this usually results into a SNOWFLAKE schema, generated by the high normalization of the ADW layer.
A SNOWFLAKE schema is a DWH design approach where we have relationships between REF tables (more likely REF tables from ADW), beside the links with the Final Fact tables ( more likely FACT tables from ADW). By following this approach, we will have REF tables which will never join directly with the Fact tables
· The GOOD part: Doesn't cause any data duplication; Faster to implement than a STAR design; Project Managers love them, because they can be implemented very fast
· The BAD: Performance, since a lot of joins are being performed; There is a high cost related to maintaining such a design within future evolutions
· The UGLY: A ROLAP design over a huge Snowflake involving hundreds of tables; Normal humans aren't able to mentally modelate such level of complexity; It's like a piece of an abstract painting, and it makes you feel really stupid for not being able to understand it. Except it's not a piece of abstract art, it's programming and programmers should be able to understand it.
Within you company you most likely already have something like an Accounting and Sales departments Data Marts.
4. Data Warehouse (or DWH):
a. A place where the information is gathered from the company various applications, centralized and prepared to be exploited within decision making process
b. A collection of DWH layers. I think one of the best definitions i've herd of this comes from Shrek movie: "Ogres are like onions. They stink? NO, they have layers". The best conclusion than i can take from here is that a Data Warehouse is like an ogre. Scary at first but a great friend, once you get to know him.
If ETL is the workhorse, that DWH is definitely the carriage. Normally, they should work in tandem, which means you should always choose your horse carefully, in order to be able to pull not only what you have in the carriage right now, but also what you think you'll have in the future. It's called scalability, or not getting left with the carriage in the middle of the road. Depends how you want to call it.
5. Presentation Layer - it's the place where the users connect and perform their reporting. Also known as Online analytical processing (OLAP), and is a set of tools which users connect in order to interactively analyze multidimensional data from multiple perspectives (Wikipedia). The cube is the basic unit used for multi-dimensional analysis. There are 2 major advantages of this technology compared with the traditional Relational Reporting:
a. Ad-hoc reporting - users can construct their own reports, on the fly, without IT assistance. This approach eliminates the gap introduced by the fact that the IT department is usually the bottleneck, since you have much more users than programmers in the IT department. It also makes:
i. the users happier by knowing that they can do the work by themselves
ii. the programmers happier, since they can focus on programming new stuff, then just concentrating on creating reports
b. Fast responding query's due to data aggregation. Compared with relational reporting, when data is aggregated on the fly, when users are running their report, in an OLAP cube some of the data is aggregated at cube process, which definitely generates a performance boost.
Normally, the Presentation Layer it's a pretty place, where everything it's very intuitive and the users can get any information they are thinking of. It's like the Aladdin Lamp, where the genie accomplishes everything faster than lightning. Well, at least when what you want is already in the DWH carriage. Since previously we were talking of layers, we also have them in here:
· Frontend Application - is the component that the users directly interact, which hides them from the complexity of the calculations performed behind. It actually transforms the user’s actions into generated code, ready to be run on the Backend server. In the example above, I’ll say it's the shinny lamp
i. Examples: Microsoft Excel; Business Objects Infoview
· Backend Application - it's the component where the generated code obtained above is being processed, and the corresponding data it's being retrieved. In the example above, I’ll say it's the genie.
ii. Examples: Microsoft Analysis Services; Business Objects; Oracle Hyperion Essbase
Currently, there are 3 major type of storage for an OLAP cube:
- MOLAP - Multi dimensional OLAP - where both data and aggregations are stored in a multi-dimensional data storage, completely separated from the relational layer bellow.
· The GOOD - Once processed a cube using this type of storage, we can even drop the associated relational tables, with no impact over the cube's data; Data is being compressed, which results in lower I/O than running a coresponding report on the relational database
· The BAD - Processing the cube introduces a latency between the moment when the data has been modified in the relational table and the moment when the users can see the updated data
- ROLAP - Relational OLAP - where both the data and the aggregations are hold into a relational database
· The GOOD - very low latency between the mommen when the data has been modified in the relational table and the moment when the users can see the updated data
· The BAD - low performance, since, each time when the result is not found into the aggregation table, it must run SELECT SUM(), GROUP BY
§ HOLAP - Hybrid OLAP - where the data is hold into the relational database and the aggregations into a multi-dimensional database. Designed as a compromise between MOLAP and HOLAP
In reality, the experience with the Presentation Layer is like going on a first date. You'll never know what you'll going to get:
§ You hope for a pretty girl, which you'll fall in love, she'll love you back, and you'll live hapilly ever after
§ In reality, you can be in either one of the following scenarios:
- You can get a pretty girl but, but she's not as smart as you were hoping for
- You get the girl, she's not the most beautiful, but she's really smart - you may have something cooking here. It's not love at first sight, but it’s an excellent starting point
- You get the girl, but she's not either smart or beautiful - then you've really made a bad choice
- You’re not smart enough for the girl - that hurts and you'll never recognize it
6. Data Mining - a set of mathematical algorithms used to determine hidden associations between data. One example of Data Mining usage is determining consumer buying patterns. I think we all know the classical example: a supermarket performs an analysis on the sales , and discovers that on Wednesday evening, which is regular football day, the most bought items by mail consumers are: beer, peanuts and tampons. In order to increase the sales, they decide to put those 3 items together in the shop, in order to be closer to the consumer buying pattern.
There is a crucial difference between the analysis performed on an OLAP cube and the Data Mining Analysis. You use cube analysis when you know what you're looking for (ex. the Sales of a certain shop on a given week). You use Data Mining when you don't know what you’re looking for. Therefore, due to the different nature of the algorithm involved, different conclusions can be reached on exactly the same set of data, and therefore, the Data Mining results are much more opened to interpretation. There is an old Russian children story which name i would say it relates to what Data Mining is doing "Go there, i don't know where, bring me that, i don't know what"
7. Change Data Capture (CDC) - it's an option which can be activated at database level, for a list of selected objects, and allows recovering of all INSERT/UPDATE/DELETE operations performed on those tables. It's based on a complex publishing / subscriber mechanism.
It solves at database level one of the most complex challenges for DWH developers, which was to rapidly identify, in the source systems, the fields which allowed the tracking of changed data. Really a pain to implement, especially when trying to handle the scenario when a DELETE has been performed on a table. Covering such scenario for the Incremental Load would have involved joins between the DWH SA table and the one from the Source Application, which were such a disaster from performance point of view, than you were really better off without it
Ideally suited for DWH Incremental Load processes, when you want to process just the data which changed in the Source System. The largest database vendors like Oracle and Microsoft allready support this option. However, it depends on the implemented DWH design to get full advantage of this feature at all DWH layers
Longtime considered of the Wholly Grail of DWH, since, properly implemented, can lead to Near Real-time OLAP, which would mean minimal latency between the moment when data changed into the source system, and the moment when data is accessible to the users into the DWH Presentation Layer
And now, back to you. How much do you agree or disagree with the above definitions? What would you add to this dictionary?