Real Time Data Warehouse it's one of the concepts that's been bugging me from quite some time. Partially because I keep hearing it from various software vendors, that are praising the performance of their Database, ETL or Presentation Layer tool. And partially because I’m really wondering if, in the real and complex IT infrastructure of a company, is it really possible. And also if it's really necessary.
Before going forward on this concept, I think it's very necessary to understand what real time and critical means. I'll take as an example the Electronic Stability Program (ESP) implemented as an active security system for car safety. Where active means it will help you preventing a crash. The ESP uses sensors coupled on each wheel, steering wheel and a gyroscope in order to determine if the car is moving in the direction in which the driver intends to. If detects that the car is skidding out of control, it applies individual braking on each wheel, and reduces the engine power, in order to bring the car to the trajectory the driver intended. In this way, the crash is prevented. And this is where you need Real time decisions, because anything less would mean a crash, which would result in a life and death scenario. And it works so good since it uses just a few data, collected in real time, by using components designed only for this purpose.
Now the very real question is if we really need such a Real Time response from a Data Warehouse, considering that it involves huge volumes of data. Are we really talking life and death decisions, based on its data, every split second? My guess is that the answer will be NO. If your business it’s all about split second decisions, than I’m afraid that Business Intelligence and Data Warehousing is really NOT the answer for such problematic. Business Intelligence it's all about better understanding your business, and taking informed business decisions, but this is never real time. Because real time means automation, and a human never takes business decisions in real time. Moreover, I strongly think that a good decision will always take some time to think about it, to analyze the data coming from the BI system from different angles, put it thru the filter of each one personal experience, and only then take the final decision. And either way you look at this process, is not real time.
More realistically, what we would expect from a Data Warehouse would be Near Real time performance, which means minimal latency for the changes from the source system to propagate thru all Data Warehouse layers, up until the Presentation Layer.
First of all, when talking about Near Real time Data Warehouse, we must have a good understanding of what it means at what it involves:
· From business point of view, as a Business Owner, do you really need Near Real time Data Warehouse in order to take business decisions? Is it the business area where company activates so competitive that anything less than Near Real time is not good enough? Beware, I’m not talking here of what the Business Owners would like to have, but what they really need to have. There is a BIG difference between the two approaches:
1. If it's just what the Business Owners would like to have, than the company can function without a problem, even if the data is not refreshed in Near Real time. And this is the case when you want a Formula 1 car, when a decent hatchback is enough for your needs. You might want to go for the F1 car, but, after looking in your pocket, you'll be able to live with the hatchback as well.
2. If it’s what the Business Owners need to have, that means that anything less than Near Real time can make the company underperform and even risk taking it out of business. And this is where you really need a Formula 1 car. The only problem is the small difference in price between a decent hatchback and a few hundred million dollars / year for a top of the line Formula 1 team. Therefore, you should be aware that there is a very big difference in price between a regular Batched Data Warehouse, refreshed once or a few times a day, and a Near Real time one. Because, for Near Real time, Performance is the name of the game. And Performance it's always very expensive.
Near Real Time Data Warehouse it's the end result of an IT system finely tuned in order to get the best performance. The system is as strong as the weakest link, and everything matters here:
IT system architecture:
1. Have the best IT professionals you can find, in order to design IT architecture for the best performance. You'll need these professionals every step of the way, from networking to database administrators, in order to squeeze the last bit of performance from the system. They are the equivalent of the mechanical engineers from the F1 team, so don't confuse them with the mechanics from your local car dealership. And that means that they don't come cheap either.
2. Are all the Source Applications and the Data Warehouse server on the same high speed LAN segment, or they are connected via WAN? If they are connected via WAN, which is pretty normal for a company with an IT distributed system around the world, then performance is impacted by:
a. Network Bandwidth - typically acquired from an ISP. Large bandwidth is always expensive, and remember that there are a lot of applications sharing the same resources, and Data Warehouse and Business Applications are just a part of them
b. Network delay - it's one thing to be connected via a LAN, for which the average network delay is <1ms, and quite a different one to be connected thru WAN, with a delay of 150 ms
3. If WAN becomes a source of delay, then a major overhaul of the IT infrastructure might be involved, in order to move some Application Servers used by the Data Warehouse on the same LAN segment, without affecting the performance of the local users, for which the Applications have been initially deployed (ex. Local ERP's)
4. Make sure you have enough CPU, RAM and Storage for both the Data Warehouse system and all Source Applications servers. It really doesn't matter that you have a state of the art server for the Data Warehouse, with plenty of resources, if your Main ERP server it's already overloaded, since, in this case, the ERP server will become the bottleneck
Source Application architecture:
1. Support from the Source Applications for Incremental Load.
2. For critical Source Applications as the Main ERP, support from the Database Engine for Change Data Capture is a must. In this way, you will be able to capture directly all changes occurred in the source system, at database level, from the last import you've performed. More details on the CDC topic can be found in the Back to basics - A Data Warehouse Dictionary article
3. You'll always have some applications for which incremental load will not be either possible or be tougher to implement. For example, you may have incoming data from Local ERP's, in the form of text files. This are usually generated as a result of Scheduled Export Operations from each subsidiary. And Scheduled is not exactly a synonym for Near Real time, especially when that export is made just once a day. You should examine if it’s possible to replace the traditional text files export with implementing CDC on the Local ERP database.
Data Warehouse architecture:
- First of all, you should realize that, in practice, when talking about Near Real Time Data Warehouse, you'll have it implemented just for those Source Applications which offer CDC support. For everything else, it will be just Scheduled Data Warehouse refresh.
- The ETL tool should support different CDC systems implemented in the various Source Applications. For the beginning, that means at least Microsoft SQL Server and Oracle CDC support.
- You should remember that data isn't just extracted from the Source Applications to the Staging Area, but its content is really enriched, at Data Warehouse level, by applying a lot of Business Rules. And the focus here should be improving the performance of the process which applies these Business Rules. Ideally, at Data Warehouse level, you have a Business Rule Engine in charge of this operation.
In practice, finely tuning the performance of this component could be very difficult, since such generic components are designed first of all to be flexible, in order to handle very different rules, and only secondary for performance. Additionally, many times you don't have access to the code of such component, so fine tuning its performance might be restricted just to operations at Database level. But more on this topic you may find in the Implementing a Business Rule Engine article.
- Make sure your Data Warehouse is vertically designed to handle CDC data. This involves having the right level of granularity at transaction level, thru all Data Warehouse layers. For each table you import, you must recover all fields which form the Primary Key into the source systems, beside other attributes you'll want. This level of detail should be kept thru all Data Warehouse layers and it's absolutely necessary in order to be able to use the advantage of CDC import from Source Applications.
You should be aware that, as soon as you'll start to aggregate data inside your Data Warehouse, you'll lose the CDC performance advantage, since you'll have 1 aggregated record in the DW Layer Fact Table formed by N records from the Work Layer corresponding table. And since, after aggregating the data, you'll have no way to determine which part from the total has been affected by that one record received based on CDC, you'll have to re-aggregate the whole data for the affected period. Which does take time, and time is the factor you don't have in abundance when talking about Near Real Time Data Warehouse.
- The performance increase brought by a CDC oriented Data Warehouse can be very significant, since it allows processing only the data which changed in the Source Systems, instead of processing the data for an entire period (ex. 1 day or one month).
Presentation Layer architecture:
1. As long as you don't rely on Data Warehouse anymore to perform the data aggregations, than the pressure is really on the presentation layer to perform this aggregations on the fly. Based on our findings from At the crossroads: MOLAP vs HOLAP vs ROLAP - Part 2 article, we know that the lowest latency is offered by the ROLAP storage. Practically, the users are able to access the updated data, as soon as it has been refreshed at Fact table level. As always, there is a catch: ROLAP cubes are famous for a lot of things, but performance isn't traditionally one of them. As long as you'll access data at the lowest level of granularity, the performance is really real time but, as soon as you'll ask for aggregated data, you'll start to encounter performance problems.
2. Sure, if you have indexed views implemented, which are generated automatically by Microsoft Analysis Services, you'll benefit from the increase performance but, refreshing these indexed view does take some time. Hopefully, column oriented data store engines like Vertipaq, which promise big performance increase while aggregating data, are just around the corner, and will be part of SQL Server 2012 Denali.
3. Realistically, you should realize that there is always a tradeoff between Data Latency and Performance on the Presentation Layer. A good solution must segment the user population depending on their Data Latency needs:
a. The ones needing absolutely the lowest Data Latency will be served by a ROLAP cube, while being aware that they might be encountering performance issues, at least until the underlying indexed views are refreshed. The cubes will be used typically by people like:
i. Accounting department, which want their changes to be reflected ASAP in the Data Warehouse, during month closing operations
ii. Marketing department, which want to measure, as close as real time as possible, the effectiveness of their marketing campaigns and to adjust it accordingly
b. The ones needing low Data Latency but who also need good query performance can be served by either a HOLAP cube or a MOLAP cube with Proactive Cashing enabled. However, be aware of the fact that MOLAP cubes and Junk Dimensions (such Invoice Details) are less than a match made in heaven for performance reasons, so restricting the number of dimensions available in the cube can improve a lot the performance. For example, it's less likely to need an Invoice dimension in a cube which is exploited mainly by Top Management.
At the end, I would conclude that you should be very aware of which problems you try to solve by having a Near Real Time Data Warehouse solution:
· Its one thing to build such a system in order to support the very dynamic business environment is which your company activates, since without it, it could lose its competitive edge.
· It’s completely another thing to use the Data Warehouse solution in order to cover holes in the company organization. For example, if it's the last legal day in which the accounting month can be closed, and if some of the local accountants are loading the data in the very last hour, and find some mistakes, which need to be corrected yesterday, is it really a problem which should be addressed by the Data Warehouse architecture, or an organizational problem?
And now, back to you: How close to Real time is your company Data Warehouse? How does your Data Warehouse architecture handle incremental data loads from the Source Applications? What storage modes have you used for the Presentation Layer in order to minimize data latency and which is the users perception of it?