Since I’ve started the series of articles dedicated to Data Warehouse Best Practices, we've been talking a lot about focusing on reusable components like Partitioning Mechanism, Implementing a Business Rule Engine or a Generic Import Mechanism. And since the most reusable component is the code itself, I thought it's a good idea to prepare an article dedicated to the Coding Guidelines to be used for Data Warehouse solutions based on Microsoft SQL Server.
We will start up discussing about SQL Server Developments in the Part 1 of the article, continue in the Part 2 with SSIS Developments and Source Control, and end up with Development Strategy for a Business Intelligence department.
SQL SERVER DEVELOPMENT
First things first. There are quite a few SQL Server Coding Guidelines out there, and one of the best I’ve come around is the SQLServerGuidelines, written by Pinal Dave. I highly suggest you to read it, since it contains a lot of valuable information for a developer.
Beside these excellent guidelines, which are applicable to any SQL Server development, I’ve tried to add some of my own, based on my experience so far, and more specific to Data Warehouse developments based on Microsoft SQL Server.
1. Normalize database structure in the 3rd normalized form, at least for Actual Data Warehouse database. This approach allows greater flexibility when needed to regroup data at Data Mart level. For the Data Mart level, the key to obtaining better performance is de-normalization, since it greatly reduces number of joins to be performed, especially when using a ROLAP Presentation Layer.
2. Avoid using INSERT INTO / SELECT * in query’s. Instead, use in the SELECT statement specific column names. In this way we avoid any unintended impact caused by adding new column in the source table.
3. Avoid using INSERT statement without a column list, since could cause significant problems when new columns are added or the order of the existing columns is changed in the table.
4. Find the appropriate balance between, temporary tables, derived tables and CTE:
a. When possible, replace the use of temporary and derived tables (sub query), with Common Table Expressions (CTE), since they generate fewer I/O operations. However, the use of CTE is limited by:
i. The use in the next SQL statement.
1. Theoretically, we could use variables of type table in order to hold the result set of a CTE.
ii. The fact that CTE are in memory objects, therefore are not suitable to hold large number of objects
iii. CTE cannot be indexed
iv. Is calculated just once, when is declared
b. Temporary Tables:
i. Local Temporary Tables:
1. When needed to temporary hold a large number of records (ex. millions), which need to be indexed, we should use local temporary tables (#). However, this approach puts additional pressure on tempdb database, so we must make sure that we have the DBA support when going on this direction.
ii. Global Temporary Tables:
1. As much as possible, we should avoid the use of global temporary tables (##), since we can easily end up in a situation in which 2 unrelated stored procedures, running at the same time, try to create a global temporary table having the same name.
2. However, using global temporary tables can be the only solution for stored procedures which contain dynamic SQL code, when the following scenarios occur:
a. EXEC (@SQL) approach:
i. A temporary table is created in the dynamic SQL, which must be accessible in the main stored procedure code. When a local temporary table is used, the data is inaccessible, and therefore is required to use global temporary table.
ii. When needing to access, in the dynamic SQL code, a temporary table from the main stored procedure, we can use a local temporary table
b. sp_executesql(@SQL) approach:
i. This procedure launch will run on a different spid, and therefore any local temporary table from the dynamic SQL or from the main procedure code, won’t be accessible between the different spid. In this scenario is mandatory to use global temporary table
c. The Derived Table:
i. Doesn’t explicitly put pressure on tempdb database
ii. Its recalculated each time is used; therefore it has the slowest performance from all approaches, in the scenario in which the derived table is used more than once in the stored procedure
iii. Cannot be indexed, resulting therefore in decreased performance compared with the temporary table approach.
5. Use appropriate aliases in SQL query, in order to improve code reusability between SQL statements:
a. The Alias should be constructed starting with the first letter of all distinct words from the table name (ex. REF_PRODUCT_FAMILY should have the RPF alias)
b. If we have, in the same query, 2 tables which would have exactly the same alias, by using the above approach, we should add additional letters, based on their corresponding tables business meaning, in order to differentiate them
c. In the scenario in which the table contains a single business word, we should build a representative alias (ex. REF_SUPPORT should have the alias RSUP)
6. When writing a SQL query, which uses multiple tables, we should always use Alias, for all columns, even if their names are unique. In this way:
a. We will always know the source table for a field
b. We will prevent situations in which adding a new field to one table will cause some of the existing SQL query to crash, due to name duplication
7. We should define Primary Key constraints for all tables, at all DWH levels. In this way, we will make sure that we have no data duplication will occur. However, you should be aware that activating Primary Key constraints on large Fact Tables will result in clustered indexes with very large size, which must be maintained as a result of each insert operation. Therefore, it really makes sense to have the Primary Key constraints on Fact Tables as long as you consistently use these key values during interrogations from the Presentation Layer.
8. Otherwise, a viable option could be, when you use partitioned tables, to develop a mechanism which checks the uniqueness in the partition which you intend to add to the Main Fact table, approach which avoids having a large clustered index on the Fact table. In case of detecting a duplicate, a custom error is thrown and the affected partition is not SWITCHED IN, which will allow you to have consistent data in the Fact table.
9. We must make sure all Referential Tables contain a Non Defined value, which are always reported on the Surrogate Key= 0 value. Additionally, we should:
a. Make sure that the value associated to this Non Defined value for the Business / Natural Key will be generic enough to never be encountered on the Referential Table Life Cycle.
b. Make sure that there is a clear understanding of the differences between this Technical Non Defined value and other Non Defined values defined in the Source Application for the Referential Table.
c. You should create a MetaData table which contains each Technical Non Defined value assigned to each Referential Table, since the range of values which can occur on the Business / Natural Key is different for each table.
10. You should take into consideration adding Foreign Key constraints for all tables in the Actual Data Warehouse, in order to avoid a situation in which we would have records in the fact tables, without correspondence into the Referential tables. When this scenario will occur, an error will be thrown by the SQL server, and the situation will be investigated by the BI Technical Team.
11. However, be aware of the performance overhead brought by adding Primary Key and Foreign Key constraints on the Fact Tables, since these constraints will be check for each row which is inserted/updated/deleted. Therefore, it can make more sense to keep the Foreign Key constraints activated just for the Referential Tables, which should be relatively small compared with the fact tables, and insure by the usage of those Surrogate Key = 0 that you will never have in the Fact Tables Surrogate Key without correspondence into the Referential Tables.
12. Avoid, as much as possible, the use of SQL Cursors, since they have a severe impact over the performance. We should try:
a. To emulate the Cursor approach with more complex SELECT statements
b. When is not possible to use the above approach, we should store the source SELECT in a temporary table or CTE containing an automatically generated Row Number, and then use a WHILE command block
13. Do not use spaces in SQL objects names. The spaces should be replaced with ‘_’. In this way, we will avoid to always use brackets ([]) when calling an object name.
14. Always use the object schema while explicitly calling an object. In this way, we will avoid using the default schema (usually dbo). As a result of this approach, we will be able to cover the scenario in which different schema are used within a database. As good news, starting with SQL Server 2012 we can finally have default schema assigned to Windows Groups, missing feature which caused a lot of headaches when using SELECT INTO without default schema.
15. Incorporate the frequently required, complicated join and calculations into views and functions, in order to avoid, as much as possible, code duplication. However, a balancing should be performed between the advantages brought by the lack of code duplication and the performance issues brought by the use of views and functions.
16. For example, for currency conversion, let’s say we have the generic function FN_CURRENCY_CONV. Using a single conversion function greatly simplifies the conversion operations, and removes any unnecessary code duplication. However, is indicated to use the function only on small number of records, since the function is called individually for every line in the result set. For larger result sets, take into consideration constructing a generic Stored Procedure in charge of currency conversion, which performs an update on either the entire table or a subset of that table, by using a JOIN between the updated table and the Currency Conversion Referential. This approach will result in much better performance than using a function.
17. While comparing nullable fields, always use the syntax ISNULL(Col1, ‘’) <> ISNULL(Col2, ‘’). This is necessary due to the fact that any comparison with NULL value is considered as FALSE. Alternatively, the COALESCE function can also be used.
18. Limit the use of dynamic SQL statements, since they are usually slow due to the fact that the execution plan cannot be precompiled. However, this is just the theory, since, in practice, you will find quite a few scenarios in which they cannot be avoided. And one such scenario occurs when using Partitioning Mechanism, when you recover the partitioning criteria using a JOIN operation with a MetaData table. You end up building a @PARTITION_LIST and then you really have to use dynamic SQL in order to benefit from partitioning advantage.
19. When using BEGIN TRAN statement, always use @@ERROR immediately after data manipulation statements (INSERT/UPDATE/DELETE), in order to automatically launch ROLLBACK TRAN.
20. Implement custom error handling based on TRY/CATCH mechanism or @@ERROR in every stored procedure and function.
21. While performing single line inserts on tables containing Identity Values, use OUTPUT clause to return the Identity value generated for that line, instead of using SCOPE_IDENTITY()
22. Use minimally logging operations, like BULK INSERT and SELECT INTO, when needing to improve insert performance, due to the elimination of the logging overhead. However, you should be aware of the risks described bellow, when using extensively SELECT INTO. Whenever possible, try to use INSERT INTO WITH (TABLOCK)
23. On partitioned tables, when needing to delete the data associated to one partition, the use of DELETE statement (logged operation), can be replaced with the SWITCH PARTITION and use as a destination an identical intermediary empty table, and then drop the intermediary table. Since this is a metadata operation, it will be almost instant, compare to the DELETE statement.
24. Avoid usage of long queries, involving complex subquery's and / or a lot of JOIN operations, since it's pretty sure they'll have low performance. Instead, you should consider splitting the initial select into smaller chunks, and use this smaller result sets to populate a temporary table, a Common Table Expression (CTE) or a table variable. In this way, you will end up with a much simpler final select, for which the SQL Server Query Processing Engine can construct a much more robust query execution plan, especially if you've added the necessary indexes on those temporary tables you have created.
25. You should try to limit as much as possible the records from each table involved in JOIN operations. For example, you have the following situation:
· Table Orders: ORDER_NO;ORDER_DATE;STORE_ID;ORDER_QTY(PK: STORE_ID;ORDER_NO)
· Table Sales: INVOICE_NO;INVOICE_DATE;STORE_ID;ORDER_NO;AMOUNT
There are 2 main ways to find out the ORDER_QTY corresponding to each invoiced order, for STORE_ID = 1
SELECT *
FROM SALES S
INNER JOIN ORDERS O ON S.STORE_ID = O.STORE_ID
AND S.ORDER_NO = O.ORDER_NO
WHERE S.STORE_ID = 1
or
SELECT *
FROM SALES S
INNER JOIN ORDERS O ON O.STORE_ID = 1
AND S.STORE_ID = O.STORE_ID
AND S.ORDER_NO = O.ORDER_NO
From performance point of view, the second select will be faster, since it filters the data right from the JOIN operation. Since the WHERE clause is applied only after solving all JOIN operations, by applying the filter directly on the JOIN operation, the initial result set from ORDERS tables is greatly reduced, and only afterwards is joined with SALES table.
26. Using SQL functions on large result sets is a performance killer. SQL functions are slow since they are called individually for each line from the result sets. You should try to replace them, as much as possible with JOIN operations using suquery's. For example, imagine that you have to retrieve the maximum value from a given list of columns. You have 2 main ways of doing this:
a. Create a function which performs this calculation
b. Create a subquery using UNPIVOT operator
The second option, based on subquery with UNPIVOT, is a few times faster than the one using a function
27. Avoid usage of Global Temporary Tables. As much as possible, you should limit yourself to Local Temporary Tables, which are not visible outside of current SPID. When using Global Temporary Tables, you can end up pretty quick having collisions between various Stored Procedures, especially when generic tables names are used (ex. ##TEMP). Instead, you should always consider using unique object names related to the business scope of that stored procedure. Also, you should be very careful with the usage of Global Temporary Tables, when the parent Stored Procedure can be called in parallel. If such scenario occurs, you should add the SPID column to each Global Temporary Tables, which will allow you to separate the individual result sets belonging to each SPID.
28. Avoid as much as possible, extensive use of SELECT INTO. Back in the days of SQL Server 2000 and 2005, if you wanted to perform non-logged insert operations, SELECT INTO was the only way to go. Starting with SQL Server 2008, you can also perform non-logged insert directly by using INSERT clause with TABLOCK hint, as long as you have no indexes on the destination table or just 1 clustered index and no regular indexes.
The BIG problem I’ve noticed while using SELECT INTO is that it generates exclusive locks on sysobjects system table, which prevents that table to be queried. This situation is potentially VERY dangerous, since there are a lot of user generated actions which lead SQL Server to interrogate the sysobjects system table, and one such example is to access the table list from SQL Server Management Studio. As long as you have an ongoing SELECT INTO, this operation cannot be performed, which will prevent ALL developers to access the list of tables from SQL Server Management Studio. And this is just the tip of the iceberg.
Now imagine that you have one Stored Procedure which regularly takes around 30 min, which includes a 1 min SELECT INTO. If, for error handling reasons, you decide to put the main stored procedure code in a transaction than you will have the very unpleasant surprise to discover that the sysobjects system table is locked for the entire 30 min. Which can generate a lot of unforeseen problems, since these system tables are used by SQL Server Engine internally. Instead of using SELECT INTO, I would strongly recommend using non logged INSERT operations, which doesn't generate such behavior, and it's almost as fast as SELECT INTO. If you were previously using SELECT INTO to generate very fast a clone of the source table, I suggest to develop a re-usable mechanism to clone a table structure, rather than going for this quick fix method, which could get you a lot of headaches.
29. Avoid, as much as possible, DDL operations inside transactions. Using DDL operations (DROP TABLE; CREATE TABLE; ALTER TABLE; SELECT INTO; TRUNCATE TABLE; SWITCH OUT/IN PARTITION) generate locks on system tables, and therefore you are VERY interested of minimizing the amount of time these locks occur, since, as discussed earlier, they can have very unpleasant effects, like deadlocks. Some of these DDL can be replaced by DML (ex. SELECT INTO vs. INSERT INTO WITH(TABLOCK)) but when these cannot be avoided (ex. SWITCH OUT/IN PARTITION), you should declare you transaction in order to isolate just the DDL operations, which are typically very fast, and avoid mixing them with complex and long execution of DML operations. In this way you will minimize the total duration on which locks are being hold by SQL Server on system tables.
If, even after minimizing the duration of the transaction in which DDL operations are performed, deadlocks still occur, you should really consider implementing a custom SQL Waiting Queue, which should serialize all those transactions containing DDL operations, for a given database. In this way, you will have some performance loss, but with the advantage of removing these scenarios which would lead to deadlocks as result of locking occurred on system tables.
30. Build a generic mechanism in order to store and transfer the security information on all SQL Server user objects. Sometimes you'll really need to use DROP / SELECT INTO approach, in order to benefit from best possible performance as a result of non-logged operations. However, each time you'll do this, you'll lose all accesses previously granted on those objects. This is why is necessary to build a mechanism which will:
a. Retrieve all security for a given list of objects and store it in a metadata table
b. Proceed with DROP / SELECT INTO
c. Based on the security initially recovered at step 1, proceed with GRANT access on the new table
Additionally, such mechanism can be used to grant accesses on a new table identical to an already existing table.
A the end of this first part of the article, I would underline once more that you shouldn’t take these guidelines as the Holly Bible, since they have been compiled based on my own experience (and people can make mistakes), and have been validated just on a certain context. You should always take these guidelines more as recommendations and filter them based on your own experience and adapt them to your company specific Data Warehouse.
And now, back to you: How much you agree or disagree with the above Data Warehouse Coding Guidelines? What will you add to this list?