First things first: a pretty good presentation of the problematic to be addressed by Partitioning can be found in the article To Have or Not to Have a Physical data mart. You should read it, since it will become quite handy during this article.
As I’ve said in the previous article, one of the main challenges for a Data Warehouse which has a ROLAP Presentation Layer on top, is to be able to refresh the data while users are running their reports, and to also maintaining the data consistency.
Starting with SQL Server 2005, Microsoft introduced Table Partitioning Feature, on which we will be focusing during this article. Other database engines have implemented this functionality, so i'm pretty sure that the concepts laid here are pretty generic.
So what's that all about with table partitioning? Is it a magical cure or a royal pain in the ... Well, as always, I’ll say it's a little bit of both.
Table partitioning it's about logically dividing the data into a table into smaller chunks, which can be better used by SQL Server Engine to both Refresh and Query the data. It relies heavily on the use of meta-data in order to determine, from the Execution Plan Phase, which are the partitions which contain the data which would solve my query request. Then, during the execution phase, it will query only those partitions, instead of relying on reading the index or even performing a table scan in order to find the data.
Let's think of the following example:
· You have a 10 GB table, which has a clustered index of 2 GB and Data of 8 GB. When you will be searching something in the table, SQL Query Engine will try to seek the index first, in order to find the Data that will solve your query, and then, read the actual data. Either way you put it, it will involve quite some I/O.
· Now, on the same example, let's imagine that you have the same table partitioned on the MonthId field, and you have around 200 partitions, each one of 50 MB. Then, when you'll be querying the data on the MonthId field, SQL Query Engine will determine, even from the Execution Plan Phase, which are the partitions that will contain the necessary data, and query only that partition. This will definitely be faster and generate less I/O than the first option.
· Now lets say that you'll be querying for a MonthId.
§ If the data for that partition is kind of half way thru the table, than, by using Scenario 1 (non-partitioned table), SQL Query Engine will read roughly 1 GB of index data, in order to reach to the index info which actually points out to the data you're interested, and then read the 50 MB of data (Total I/O 1.05 GB).
§ When using Scenario 2 (partitioned table on MonthId), the SQL Query Engine will determine directly from execution plan phase that there is just one partition containing the data that I’m interesting. This will mean a Total I/O of just 50 MB.
§ Pretty impressive, isn't it?
- Known limitations - also known as the Royal Pain part:
- You cannot partition a table on more than one field. That's a bugger, since in real life, it would have been interesting to partition the table on something like MonthId and Store. No one says that you cannot query the table on the above fields, except that the Query Engine will scan all partitions containing those MonthId, which contain, let's say data for 100 stores. When you were interesting to see just the sales data for one store.....
- You can't get the Partitioning criteria as a result of a join or from a variable. That's the part you'll start thinking to the f*$ word.
i. Just when you start thinking that there is a workaround for the first limitation, it hits the second one. For the part with the partitioning limited to a single field, the following workaround can be applied:
1. Create a Meta Data table (ex. PARTITION_META_SALES)which will hold the criteria used for Partitioning: PartitionId (AutoIncrement); MonthId; Store
2. Based on the provided MonthId and Store, you make a select from PARTITION_META table, to recover the range of PartitionId
3. Assuming that you've partitioned your SALES fact table on the PartitionId field, you write the following query:
SELECT * FROM SALES S
INNER JOIN PARTITION_META_SALES PMS ON S.PartitionId = PMS.PartitionId
WHERE MonthId = @MonthId and Store = @Store
You'll expect, when you look at the Execution Plan, to see that only the partitions containing the data associated to the filters will be used. Instead, you'll see that all partitions will be used to solve the query
ii. Why, oh, why?
1. Because SQL Server uses Meta Data to generate the Execution Plan, which doesn't involve the actual execution of the query. By looking on the above query, you'll realize that there is no way for the Query Engine to know your PartitionId range, without running the query. SQL Server expect a scalar value for the Partitioning criteria, and it's not negotiable
iii. The FIX:
You can build a Dynamic SQL statement which will contain a scalar or a clear typed list of values.
1. As first step, you create a @PartitionId_List variable, where you'll put the list of PartitionId, as recovered from PARTITION_META_SALES
2. You build the following statement:
DECLARE @SQL VARCHAR(2000)
SET @SQL = 'SELECT * FROM SALES WHERE PartitionId IN (' + @PartitionId_List + ')'
EXEC (@SQL)
- Not the prettiest solution, but it does save the day. Well, kind of... Since it cannot be used in every scenario:
i. You can use this solution during ETL Phase of the Data Warehouse, as long as use Stored Procedures in order to Load the data. Which will help you a lot in:
· Improving load performance
· Minimized data consistency issues
· Error Handling
- Not quite the Magic Fix you were hoping for the Presentation Layer
i. As long as you use any form of ROLAP or HOLAP Engine, your SQL Query's is being generated as a result of the user actions. Which kind of means that it's either very difficult to not possible to squeeze the Dynamic SQL Part which uses the list of PartitionId.
ii. You can use the solution based on PartitionId logic in order to optimize the Cube Processing Part, as long as you use a MOLAP Engine. If you have a partitioned fact table, and a cube which is also partitioned in such a way that a cube partition will correspond to a table partition, the cube processing time can be greatly reduced.
Regarding the number of partitions, by default, a partitioned table can contain up to 1000 partitions. However, Microsoft received a lot of request from companies having large Data Warehouse implementations on SQL Server 2008R2, and issued a patch which, once activated, can increase the number of partitions to 15000 / table.
As a conclusion for this first part of the article, we've learned so far some of the advantages and limitations of Table Partitioning. It might not suite every scenario but there are quite some scenarios were we can see the difference. In the second part of the article we will talk more about what table partitioning actually involves.
Until the second part of the article reaches your shores, how much do you like or dislike using partitioning so far? How dificult was to implement within your company's data warehouse? Have you gone for custom coding or for an out of the box ETL solution with partitioning support?