Now that we've cleared up the main concepts associated to the Table Partitioning into the Part 1 of the article, let's make a list with the prerequisites necessary to apply the Partitioning Mechanism:
· Partitioning Function - It contains the range of values which will be used for partitioning
· Partitioning Schema - created based on the Partitioning Function, and used for creating partitioned tables. And obviously you can create more tables using the same partitioning schema
· The table is created based on Partitioning Schema:
CREATE TABLE SALES
( PartitionId int NOT NULL,
Store varchar,
MonthId datetime,
Amount money
) ON PSCHEME_SALES(PartitionId)
And now, let’s talk a bit about the logic implemented by the Partitioning Mechanism:
1. A partition is characterized by a PartitionNo, which is used internally by SQL Server in order to identify the partition, and by Partition data
2. Every Table, even those not explicitly partitioned, contain at least 1 partition
3. When you insert data into a Partitioned Table, SQL Server will know how to distribute the data into the appropriate partitions, as long as you've completed the PartitionId field. Otherwise your INSERT will fail.
4. Partitions can be used in order to bulk delete instantly large volumes of data, without triggering the default SQL logging used for any DELETE operation. Even if there is no TRUNCATE TABLE PARTITION command (hopefully they'll implement it in future versions), you can do the following:
a. Create an OLD table (ex. SALES_OLD), having IDENTICAL structure, constraints and indexes with the table from which you want to empty a partition
i. The simplest way to do it is to use the same Partitioning Scheme as the one on the original table. The disadvantage is that it will create alot of empty partitions, from which you'll likely use just one
ii. If you decide to build it on PRIMARY, then you'll have to manually add a CHECK CONSTRAINT ON PartitionId = @PartitionId. This is necessary, if you remember, since SQL Server needs to pre-determine, during Execution Plan Phase, what's the range of values from that Partition, whitout actually executing the query.
b. Identify the internal PartitionNo associated to your partitioning criteria which uniquely defines a partition (ex. PartitionId)
c. Use the following statement to perform a partition switch:
i. ALTER TABLE SALES SWITCH OUT PARTITION @PartitionNo TO SALES_OLD PARTITION @PartitionNo
ii. When you do this, no data is phisically moved, instead you move just the pointer for that partition from SALES to SALES_OLD table. And this is why the operation is instant.
iii. You will end up with an empty @PartitionNo into SALES table
iv. In order to physically remove the old data, you just have to run DROP TABLE SALES_OLD
5. In order to minimize Data Concurrency issues, usually occurring when the ETL processes try to refresh a table while users are querying the same table via the Presentation Layer, the following approach can be used:
d. Create an TMP table (ex. SALES_TMP), having IDENTICAL structure, constranints and indexes with the destination table
e. Perform all loading operations on the TMP table. Since users don't have access to this table, you'll have the great advantage of having a clear separation between the Data Refresh Flow and the Presentation Layer
f. If any errors occur during the load of the TMP table, then you don't need to wory about a costly transaction ROLLBACK. Since the final table isn't touched in any way during load, you'll just have to DROP TMP table, and you will keep into the the final table the latest available data
g. If no errors occur during the load of the TMP table, you can proceed with SWITCH IN statement:
i. ALTER TABLE SALES SWITCH IN PARTITION @PartitionNo FROM TMP_ SALES PARTITION @PartitionNo
h. In real life scenarios, you'll be coupling both SWITCH OUT and SWITCH IN operations in a transaction, together with a TRY/CATCH mechanism. In this way, if any error occurs, you won't lose your data.
6. You should be aware that both SWITCH commands will hold an exclusive lock on the involved tables. During that period, the users will be unable to access the final table. The advantage here is that, since this is a Meta Data operation, both operations will be very fast so the users will not be impacted.
7. Partitions can be both MERGED and SPLITTED, which increases the flexibility of the solution, while keeping the partition number in check. For example, you might be interested to have the data for the Current Year partitioned at Month level, while the data for the Previous Years can be aggregate in a Single Partition per Year. For this, you can build an SQL Job which will run once a year, and perform all necessary MERGE partition operations.
8. As you have seen until now, there are some significant advantages associated with the implementation of Table Partitioning, but there are also some drawbacks that we should be aware of. Besides the ones we've already discussed, another significant one is the complexity of the operations involved.
In order to simplify the adoption of the mechanism at Data Warehouse level, the best way to do it is to build a set of generic Stored Procedures, which encapsulates the above logic, and isolates the programmer from the complexity of the operations performed above, while adding very little additional code. Therefore, you should buckle up your seatbelt and proceed with the creation of the following generic components:
i. PRC_CLONE_TABLE - Such procedure will be used in order to construct the table structure for both TMP and OLD tables, by using exactly the same column names, data types and partition schema as for the original table
i. At this stage, you might be thinking:
1. Why bother with such a procedure when you can create an identical copy by using a SELECT INTO ... WHERE 1 = 0?
2. Because all you columns will be nullable, and none of the constraints or indexes will be copyied, which kind of means much less than identical. And, you can trust me on this, the SWITCH commands are VERRY VERRY picky about the table structure
ii. You will have to address the VERY TRICKY problem of the Identity Colum.
1. Having Table Partitioning, Partition Switching and Identity Column it's far from a match made in heaven.
2. You can activate Identity Property on a column in a partition table, and it will work fine, as long as you don't intend to perform any partition switch. When you'll do, you'll be realizing that, if you clone your TMP table in the standard way, you'll start up having IDENTITY(1,1), which will generate the same ROW_ID value as others already existing into the FINAL table. This isn’t quite what you were hoping for.
3. If your Business Module procedure (ex. PRC_SALES) is called sequentially for each partition, then you can build your ROW_ID column into the TMP table by using IDENTITY(@MAX_ROW_ID+1,1). And yes, you've guessed it, @MAX_ROW_ID is MAX(ROW_ID) from the final table.
4. But what do you do, if you really need to call the procedure in parallel, lets say from SSIS, for each partition, in order to improve the performance? Then, I’ll say, you really have a problem, since MAX(ROW_ID) method is completely unreliable at this point. It's not that the problem it's unsolvable, it's that the actual cure for the problem might be more painful than the disease. But this one of the challenges where I’ll let you propose the solution
5. Alternatively, you could modify your Business Module procedure to receive a list of partitions, instead of just one.
6. The BIG advantage of this solution is that you'll let SQL Server Engine handle the parallelism during execution, while you can still use the approach based on IDENTITY(@MAX_ROW_ID+1,1) to have consistent identity values.
7. The BIG disadvantage will be that you'll have to transform your code into Dynamic SQL in order to have a list of scalar values for the Partitioning field, which will allow SQL Server Engine to pre-determine the partitions to query. This approach will not only affect code readability and complicate debugging, but can also slow down performance since the Execution Plan for a Dynamic SQL cannot be saved together with the stored procedure execution plan, and therefore is generated each time the procedure is launched
i. But if you really really need to call the procedure in parallel, for different partitions, then you'll really have a tough nut to crack.
j. PRC_CLONE_CONSTRAINTS_INDEXES - Surprisingly, does exactly what it says, meaning it will create all the necessary constraints and indexes.
k. PRC_TRUNCATE_PARTITION - Used to create the OLD table, with all necessary constraints and indexes, and to perform the SWITCH OUT operation
l. PRC_REFRESH_PARTITION - Used to perform the SWITCH IN from the TMP to the FINAL table
m. PRC_UPDATE_PARTITION_RANGE - Assuming you already have a procedure which encapsulates the logic for a given Business Module (ex. PRC_SALES), which receives as a parameter the Partitioning Criteria (ex. MonthId), you might be interested to refresh the SALES table for a given range of partitions or for all of them.
i. If you don't intend to manually call your equivalent of the PRC_SALES procedure up to 1000 times, the maximum number of allowed partitions in the default configuration, then building such a procedure can prove quite handy.
As I’ve said, there are a lot of things to like and dislike about Table Partitioning. One of the things i like the most is that, in combination with a Physical data mart Layer, it offers you the possibility of keeping a consistent set of data accessible to the users, while refreshing the data behind it. Just imagine the following scenario:
· You have something like 10 FACT table in the Data Mart, each one refreshed by its own Stored Procedure
· You could modify Partitioning Mechanism procedures in order to perform the SWITCH IN operation into the final table, not at the end of each Business Module Stored Procedures, but only when all Stored Procedures finish. In this way, you can commit all the changes in a single transaction, and make sure the data is consistent
Another big advantage of using table partitioning is the improved Error Handling. Since all heavy processing is performed on a TMP table, to which the users don't have access, if you encounter any error during this phase, you will just not switch in the partition from the TMP to the final table. In this way, you will always have the latest valid data into the final table, without the usage of costly transactions.
Is this cool or what?
And now, back to you. How much have you used table partitioning until now? How much difficult you consider it's implementing such a mechanism within you company Data Warehouse? Is it worth the effort? How would you approach the Identity Column problem during parallel execution?