In the first part of the article we’ve talked quite a bit about Data Warehouse Coding Guidelines. We all go thru a love and hate relationship with rules in general, and Coding Guidelines are no exception. But the whole point is that there are really necessary, if you want to have a consistent Data Warehouse. Having consistent coding rules is just a part of the solution, since, in order to ensure consistency, elements like ETL (in our case SSIS) development rules, Source Control and an overall BI Department Development Strategy come into place. Always remember that a chain as strong as the weakest link, and Data Warehouses make no exception. And now, let’s go into detail about each one of the above components:
SSIS DEVELOPMENT
1. Implement extensive usage of Configuration Files and Tables. As long as you have different environments to manage, one of the worst ideas is to have the package connection strings containing hard-coded values, since this will involve manual modifications before each deploy on a new environment. Instead, you should go for the following approach:
a. Use a configuration file or an environment variable which contains the connection string to the server which contains the SSIS Configuration table corresponding to that environment. This configuration will be used when you launch the SSIS package
b. In order to initialize all connection managers declared at package level, we should use the SSIS configuration table, which contains the values corresponding to that environment. In this way, each time we need to recover the value of a connection manager or variable we should add just a new line into this table. The SSIS configuration table can hold values for different SSIS packages, and therefore greatly improves the flexibility of the solution.
c. You can use Package Configuration in order to inherit the connection string to the SSIS Configuration table into children packages. In this way you will be able to access the SSIS configuration table from any child package. However, a BIG WARNING here:
i. No matter the order chosen in the Package Configuration, the values from the SSIS configuration table are ALWAYS read before the Parent Package Variables
ii. This will cause you BIG problems when you change environments. Let's assume that you've just migrated a development on a child package from DEV to PROD environment. When the package execution starts, first is performed the syntactical validation and afterwards Package Configuration is read, which means that the SSIS configuration table will be read first, using a connection string which points out to the DEV environment. Only afterwards the Connection String for the SSIS configuration table will be read from the parent package, but, at this moment, it's already too late, since all the variables have already been initialized from DEV environment. And definitely it's a BAD thing to run on PROD environment packages which point out to DEV environment.
In order to fix this issue, you'll have to:
· Remove the usage of SSIS configuration table from Package Configuration and keep just the Parent Package inheritance of the Connection String for the SSIS table.
· Add as first step in the package an SQL step which initialize the variables based on the SSIS configuration table based on what right now is the correct connection string
d. As long as you use Configuration files and tables, make sure that you set Delay Validation = True for all Connection Managers and all steps which use variables initialized based on Configuration. This way you prevent validation errors during Package startup, when you still have old values on the variables, since the new ones haven't been read yet. Delay Validation = True means that you delay the task validation until its actual execution, when you already should have correct values in the variables.
2. Use distinct Windows Accounts for each SQL Server Agent installation in order to avoid security issues. In the above topic referring to the wide use of SSIS Configuration files and Tables one of the biggest danger is to end up in a scenario in which you're running packages on another environment than the one expected. Since such problems can always occur, you're certainly more interested to have a failure of the package right from the beginning, when you're trying to run selects on another server than the one you're supposed to.
As long as you implement a strategy where you have individual Windows Accounts for each DEV, INT, UAT and PROD environment, and each account has access on just one environment, you will have just such behavior. Additionally, this will ensure that a developer which runs a job from DEV environment won't be able to run anything on another environment.
SOURCE CONTROL
1. Source Control everything. Using Source Control with SQL Server Database Engine has never been a walk in the park, mainly due to the lack of integration between the development environment (SQL Server Management Studio) and the Source Control Solution of choice (Visual Source Safe, Team Foundation System, Subversion, etc). Which doesn't mean that you don't have to do it. For every development process is essential to have versions for each evolution of a given object (ex. Stored Procedure, View). In this way, you will be able to:
a. Compare individual versions of the same object, and understand the changes which occurred between these versions
b. Recover rapidly the latest stable version of that object, without DBA assistance
Without using a specialized tool, synchronizing the Source Control Solution with the Database involves usually a lot of manual operations. Now new tools, like RedGate Source Control, have appeared, which are designed especially to increase the productivity of such operations, by allowing you to perform the following operations right from SQL Server Management Studio:
· Rapidly script entire databases and mount them on Source Control
· Visual detection of the scenario in which the database version of an object is out of sync with the Source Control
Moreover, since both SSIS and SSAS used BIDS for development, which is essentially a cut-down version of Visual Studio, you benefit of the Source Control integration right from the development environment. I strongly think that implementing such solutions is absolutely necessary in order to have a healthy development environment.
DEVELOPMENT STRATEGY
1. Create a Business Intelligence Department Wiki. This is an essential step in increasing the efficiency of the development work perform inside the department by:
a. Providing a unique point of entry for all Technical Searches. Either if we are talking about Coding Conventions, Best Practices, Source Control Strategy, Environments Strategy, Deployment Strategy or Development notes on each modified object, it should be all found in here. Even Projects should be present as a chapter here, even if sometimes they just include a short description and the most up to date links to the SharePoint location containing the documentation.
b. Regarding the Deployment Strategy, in here you'll have the complete migration plan and migration scripts for every migration perform from DEV to any subsequent environments, up until PROD environment
c. Regarding the Development Notes, in here you'll have detailed documentation about each evolution (Source Control version) for a given object. In this way, you will avoid adding excessive comments to the modified SQL Objects (ex. Stored Procedure), which will have to contain just the absolute minimal information and the link to the Wiki location which contains in detail the modifications
2. Create an Automated Deployment Mechanism for all developments. In a real life Business Intelligence Department, you'll always have a lot of environments to manage. Normally, you should Source Control just the DEV environment and you should group the development into builds, and proceed with deployment of each build on the next environment (INTEGRATION, UAT and PROD). As long as you have a trace of each build has been deployed on each environment, you will be able to determine which Source Control version of a given object has been deployed on a given environment. Such strategy is very important in order to be able to quickly rollback to the latest working version previously deployed on that environment
3. Define a Harmonization Development Strategy. Imagine that you have multiple projects in development simultaneously. Sooner or later you'll end up in a situation were the same object will have to be modified by both projects. This will create some problems, regarding to the start version for that object that will be used by each project. Shall be used the latest version which reached PROD environment or the latest version of the object, even if it hasn't been yet validated? Moreover, you'll have to establish how many DEV environments you intend to handle:
a. If you have large projects developed simultaneously, how you will be able to handle on single DEV environment different versions of the same object? You can have on Source Control different versions of that object, for each project, but you cannot have more than one object version deployed on SQL Server Database engine. If you decide to use different names in order to handle the scenario, you'll lose the advantage of the Centralized Source Control, which will now keep different versions for each individual object
b. If you decide to create one DEV environment for each large new project, than you will have a completely new Source Control installation, and you will :
i. Be unable to Compare the differences for the same object for different Source Control installation
ii. Have no way of knowing when the other project team is modifying their own version of the object
iii. You will no longer have a single version of the truth, which needs to have all version of an object under a single Source Control installation
c. You'll always have Technical Support Activity modifying objects in order to correct bugs on PROD environment. These corrections will have to be incorporated ASAP into ongoing developments.
4. Define a public MAP of the roles of each team member. In this way, you will avoid the philosophical questions like "How was it possible for such thing to happen?". As long as you don't have a clear for everyone list of responsibility, you always risk having a situation where each team member thought that someone else will do a certain activity, and end up in a situation where no one has actually done it.
And this is where the pseudo-management separates from the real-management. In pseudo-management you just want to have your back covered in case something goes wrong. In real-management you just want to prevent problems to occur in the first place, which means that you're more focused on understanding why a certain problem appeared, and how can be prevented in the future, rather than just finding the proverbial scapegoat.
Overall, this means that you should identify key roles inside your Development Team and assign people to them. This means you should have, beside developers, the following roles defined:
a. Technical Architects which defines the overall architecture of the Data Warehouse, and approves the Development Solutions for various projects, presented to them by the Technical Project Coordinator
b. Technical Project Coordinator, usually a Senior Developer, which is in charge of defining a consistent technical solution for the project. His responsibility is mainly related to the Macro-vision for the technical solution of the project, and assists the developers in defining the Micro-vision of each major component.
c. Code Reviewers, usually made of Senior Developers, which make sure that the created code is compliant with the rules defined at departmental level
d. Data Warehouse Administrators, in charge of:
i. Performing the Daily Verifications, Monitoring the performance of the Data Warehouse and the problems occurred. This is where the Data Warehouse Administrative Dashboard proves quite handy.
ii. Define initial resource requests, Setup and Parameterize the different environments used by BI department
iii. Monitor the resource usage for each environment, define thresholds for each monitored resource and request new resources once these thresholds have been reached
e. Migration Coordinators, in charge of validating the migration plan created by developers and performing the actual migration to various environments. Usually a role that can be handled by a Data Warehouse Administrator with the assistance of the Technical Project Coordinator
As a conclusion of this article, I would say that there are many factors which make or break a Data Warehouse. How well the activity of the Business Intelligence department is being organized, and how clear are the responsibilities of each team member is very important. The quality and the manageability of the code being written, based on clearly defined guidelines is another one. But the most important thing of all is to be able to actively listen to your customers, and to continue to adapt the Data Warehouse in order to match their business needs, even if sometimes this will send you back to the drawing board. Don't be afraid to take on a challenge, just because what you have right now just doesn't cut it. Instead focus on building your Data Warehouse based on reusable stand-alone components, which can be inter-connected, much as a Lego game does. And remember as a kid how much fun you've had with Lego games!
And now, back to you: How clear are the roles of the development team defined inside your company Business Intelligence department? Do you use Source Control for the Data Warehouse developments?