Building a Generic Import Solution in order to feed your Data Warehouse is an essential step in order to provide a solid foundation on which current and future Business Intelligence projects are being build. Such a component will add to other generic components like Business Rule Engine and Partitioning Mechanism, and will allow the Development phase of a Business Intelligence project to be focused on the specificities of each project, rather than spending each time massive amount of times in order to re-invent the wheel.
Developing a generic component will always take more time than implementing a quick fix approach, but, at the end, such a component does offer a much higher Return of Investment by:
· Saving a lot of time in future developments
· Streamlining the whole Data Warehouse architecture
The only catch of this approach based on Generic Components is that it involves a Data Warehouse Architecture which is outside the scope of a regular Data Warehouse business project. Such a project is always business driven, and users will never ask for things like Partitioning Mechanism, for example. That means that it's under direct responsibility of the Business Intelligence department to take such initiatives, and to get the support of the upper management in order to build the technical foundation which will form the future backbone of the Data Warehouse. Which means that, in short and mid-term, you won't be able to make as many business projects as the business owners would like, which can make the Business Intelligence department somehow unpopular.
But I’m pretty sure that having a solid technical foundation will benefit a lot all Data Warehouse projects, in mid and long-term. As long as you are able to explain the business users on your intentions and the upcoming advantages, I think you'll have a win here. And this is necessary because the Data Warehouse Architecture scope stretches far beyond a regular Data Warehouse business project, which means that, at the end, a Data Warehouse is something more than the sum of all Data Marts.
Now let’s go back to our horses, (but more on the link between horses and ETL in the Choosing the right ETL article) and get some insights on the topic of Import Solutions. The first thing we should talk about is how a standard import solution looks like on a regular Data Warehouse. My own experiences are related to the use Microsoft SQL Server Integration Services, but I’m pretty sure that the problematic presented here is very similar on other ETL systems.
Typically, when you have to import some data, you will create an SSIS package, for each distinct Data Source and Business Domain, for which you need to import data. Within, each SSIS package, you will have a Data Flow component for each table you intend to import. So far so good, but let’s think a bit about the advantages and disadvantages of such approach:
1. On the BRIGHT side:
a. From performance point of view, the typical OleDb Source and Destination are indeed very fast. Honestly, I haven't found too many ways to make the import significantly faster than this. Using SQL syntax based on OPENQUERY does offer very good performance, partially due to the fact that the sql query is solved entirely on the destination server, but is just a similar level of performance, nothing more.
b. You'll have the advantage of parallelism, which means you'll be able to import more tables on the same time, in fact as many as distinct data flows you have created. There is a warning here, respectively, too much of something good can be bad, and this is such an example. Having too many parallel Data Flows running can effectively over-stretch the server resources, especially on the storage and CPU side, and then you'll risk ending with performance problems.
2. On the DARK side, there are still the following issues:
a. On the Teamwork side, as far as we are talking of individual packages, you cannot have more than one programmer working on the same package. Which is a severe drawback on the Efficiency side, especially when you have to import dozen of tables per package, since they all belong to a single Business Domain.
b. It's very difficult to not possible to build a Generic Import Solution in SSIS, by using the regular OleDb Source and Destination, since even if is possible to change the Source and Destination Table and Column Name by using Expression Editor, is not possible to change the mapping between the Source and Destination column names, since it's hard-coded at Design time. Personally, i would have been more than happy to be able to do this programmatically, by using these generic components, but so far, even in SQL Server 2008R2, cannot be done.
c. Having a Streamlined Architecture and Specific Import Packages for each Data Source and Business Domain are 2 VERY incompatible approaches. In a regular Data Warehouse you have dozens of different Source Applications, and you'll have hundreds or thousands of tables to import. Creating an SSIS package for each Data Source and Business Domain can be a solution in the short term, but on mid and long term will lead to a maintenance and administrative nightmare, each time you have to perform a modification.
The best solution that I’ve found for solving such issues has been to design a Hybrid solution, which combines the Flexibility of an SQL Solution with the Parallelism and Error Handling advantages offered by an SSIS solution. It's based on the following logic:
On the SQL Server Side:
- Create a generic table which will be used to hold parameterized SQL Server statements (ex. IMPORT_PAREMETRIZED_STATEMENTS). These import statements will be associated to the concepts of APPLICATION and REGION. Where APPLICATION refers to the Source Application from which the data is being imported and REGION refers to the Geographical Region, since you may have different installations of the same application into different regions (ex. one version of ERP installed for US region and another for EMEA)
This table will contain import statements based on either OPENQUERY or OPENROWSET approach:
i. You should use OPENQUERY each time when is possible to have a linked server to that destination. Running queries directly on the remote server does offer a very good level of performance, while pulling data, since the records are extracted in batches of records, rather than row by row. The best example of using OPENQUERY is when needing to extract data from an SQL or Oracle Server. However, the SQL syntax must be adapted in order to match the one from the Source Server
ii. You should use OPENROWSET each time isn't possible to have a link server. Sometimes you have to import data from Text, Excel or XML files. Sometimes the filename change on a daily basis. In both cases OPENROWSET does a very good job.
Here is how such a generic import statement would look like, based on OPENQUERY approach:
SELECT * INTO @DEST_DATABASE.@DEST_SCHEMA.TMP_@DEST_TABLE_NAME
FROM OPENQUERY([@SOURCE_SERVER],'SELECT * FROM SOURCE_DATABASE.@SOURCE_SCHEMA.INVOICE_LINE WHERE ACCOUNTING_DT >@START_DT'
As you may see from the above example, any parameter which is environment specific is present into the table by using the @PARAMETER_NAME. Some of these parameters are statement specific, and will be recovered based on other fields found in the IMPORT_PAREMETRIZED_STATEMENTS table, since they are specific to each import statement (ex. @DEST_TABLE_NAME ). Other parameters are generic and apply to each APPLICATION and REGION, and can be found into IMPORT_PARAMETERS table.
The table will have the following structure:
· SQL_ID - INT, Auto Increment (Surrogate Key)
· SOURCE_APPLICATION - VARCHAR (PK)
· REGION - VARCHAR (PK)
· DEST_TABLE_NAME - VARCHAR (PK)
· SQL - VARCHAR(MAX) - contains the parameterized SQL statement
- Create a generic parameter table (ex. IMPORT_PARAMETERS), which will contain all the general parameters defined for a specific APPLICATION and REGION.
The table will have the following structure:
· SOURCE_APPLICATION - VARCHAR (PK) - ex. MAIN_ERP
· REGION - VARCHAR (PK) - ex. EMEA
· PARAMETER_NAME - VARCHAR (PK) - ex. @SOURCE_SERVER
· PARAMETER_VALUE - VARCHAR - ex MAIN_ERP_LINK_SRV
- Create the tables related to the Import Set Concept. Basically, the Import Set will allow to group different SOURCE_APPLICATION and REGION in order to be executed under a single run of the import mechanism:
· IMPORT_SETS (IMPORT_SET_ID (SK), IMPORT_SET_NAME(PK))
· IMPORT_SOURCE_APPLICATION_REGION (SOURCE_APPLICATION_REGION_ID, SOURCE_APPLICATION, REGION)
· LINK_ IMPORT_SETS_SOURCE_APPLICATION_REGION(IMPORT_SET_ID, SOURCE_APPLICATION_REGION_ID)
- Create the table containing the final import statements, to be executed (ex. IMPORT_FINAL_STATEMENTS).
The table will have the following structure:
· SQL_ID (FK)
· SQL_TO_EXECUTE
· SUCCESS_SQL
· FAIL_SQL
· HAS_BEEN_PROCESSED
· THREAD_NO
- In order to populate the table we will create a stored procedure (ex. PRC_GENERATE_FINAL_IMPORT_STATEMENTS), which receives as a parameter the @IMPORT_SET_NAME. The procedure performs the following operations:
- By using the above mentioned table structure, starting from the Import Set, we will able to extract all associated statements from IMPORT_PAREMETRIZED_STATEMENTS
- For the list of parameters defined in the IMPORT_PARAMETERS table, for each specific SOURCE_APPLICATION and REGION, we will proceed with the replacement of each parameter, on the SQL field for each record from IMPORT_PAREMETRIZED_STATEMENTS table. This information will be used to complete the SQL_TO_EXECUTE field from IMPORT_FINAL_STATEMENTS table
- Based on the desired Error Handling logic, the SUCCESS_SQL and FAIL_SQL will be completed. For example, if you have implemented something like Table Partitioning, described in detail in The Name of the Game - Table Partition article series, you could go for the following approach:
i. SUCCESS_SQL - if the content of the SQL_TO_EXECUTE is executed successfully, you may proceed with the refresh of the Final Import table, based on the content of the TMP table generated, by using SWITCH OUT / IN mechanism, which is very fast.
ii. FAIL_SQL - if the content of the SQL_TO_EXECUTE is executed with failure, you may decide to drop the TMP table in order to free up disk space, since it contains wrong data anyway
On the SSIS Side:
1. We will create a generic IMPORT package, which will be able to run on different Threads, the import statements from IMPORT_FINAL_STATEMENTS table
2. We will have to determine in design mode, what is the maximum number of threads that we intend to run in parallel, without overloading either the Data Warehouse or Source Application server. Afterwards, we could parameterize the maximum number of threads to be activated, for each application
3. We will execute a loop within each sequence container (you will have one for each thread), which will:
a. Recover the next import statement to be run from IMPORT_FINAL_STATEMENTS, having HAS_BEEN_PROCESSED = 0 and THREAD_NO IS NULL, and mark the THREAD_NO column with the Current Thread number. Once the THREAD_NO has been completed, all other threads will automatically not take this record under consideration.
One of the things that you will have to be aware of is that, in order to solve any issues raised by parallel execution (ex. deadlocks occurred while updating the IMPORT_FINAL_STATEMENTS table, no matter the transaction isolation level used), a custom SQL Waiting Queue should be implemented. This will ensure that a certain statement will be processed by the first free Thread, and that you will never have a scenario in which 2 threads are trying to process the same row from IMPORT_FINAL_STATEMENTS, in the same time.
b. Execute the content of SQL_TO_EXECUTE field, for the current import statement
i. Based on the result of the execution of _TO_EXECUTE field, proceed with the execution of the SUCCESS_SQL or FAIL_SQL fields
ii. Launch the RETRY mechanism in case of failure. Based on a parameterized @MAX_NO_RETRIES, each failed statement will try to be executed a few times before giving up. This is very useful in case of micro-network ruptures, and a retry will ensure recovery from such scenario
iii. Mark the record with HAS_BEEN_PROCESSED = 1
At the end, let’s summarize the main advantages of this Hybrid Import solution:
1. Streamlined ETL architecture, since you have just one generic IMPORT package, instead of dozens, for each Source Application and Business Domain
2. Improved Team Work and Productivity, since you can have different programmers working simultaneously at the import statements for the same application
3. No modifications are required at SSIS level, when needed to import data from new Data Sources
4. Improved Error Handling, Performance and Data Concurrency if you use something like Partitioning Mechanism
5. Easy to create a ASP.net Web Interface, in order to:
a. Populate all setup tables
b. Generate the mappings between Source and Destination columns
c. Reduce the time spent by programmers in order to generate import statements
And now, back to you: How do you handle the subject of Data Import in your company ETL? What solutions have you come up to in order to improve Teamwork and Productivity for Data Import?