Choosing the right ETL

by Nicolae Guse 6. November 2011 19:51

Choosing your ETL solution is one of the most important decisions that you'll make, when you're starting a new Data Warehouse. Why is it so important? Because is like in the horse and carriage example that I’ve been talking in the Data Warehouse dictionary article. Where ETL is the horse and the Data Warehouse is the carriage.

By following this logic, you'll have to answer yourself: What kind of horse do I want? Well, that's an easy answer: The best one.....But what is this actually mean?

1. The Fairytale Kind: the fastest horse, who can run like the wind, no matter what distance and no matter how heavy the carriage is. And yes, of course, it's as cheap as you can afford it, typically a gift from the Fairy. And now you know why it's called the Fairytale Kind

2. The Real Life Kind: in here you'll see that you'll have much more options. But wait, having more options, isn't something good? I have a feeling that we are going to find out:

· The Pure Racer Kind: is the fastest runner you're going to find. It will integrate your data in the blink of and eye...Well, a bit more, but you get the picture. The problem with this type of horse is that is a specialized one. It will get you there very fast, but it will work only in a handful of scenarios. Which typically will mean for you that it's more likely to carry a light chariot than a heavy carriage. And, let’s not forget, racing always sound like expensive...

· The Mule Kind: but wait, this isn't quite a horse, right? You've guess it, it's half way between a donkey and a horse. And why would you want to go there? Well, first of all, because mules are very good on carrying heavy loads, over long distances and high slopes. It's definitely the slowest of all, and also the most stubborn one, when you need to add something new in the carriage, but, at the end of the day, will get you there. It's also typically the cheapest of all.

· The Workhorse Kind: is typically the one you'll want on your carriage. I'll say it's the best compromise between the Pure Racer and the Mule. It's not either the fastest or can carry the heaviest load, but will get you there in a fair amount of time, with a pretty heavy load. Also, it's the most versatile of all, and it can be modified fairly quickly to tow very different loads. And definitely what you'll want in a Data Warehouse is versatility. If you can afford the price, of course.
As you may see, none of them is as good as the Fairytale one, but unless you have a Fairy just around the corner, you'll have to settle with these ones.

Now, that's enough about horses, let’s go back to the ETL. In order to make the right choice, there are a few things that you'll have to take into account:

· Out of the box support for the Database engine used by your Data Warehouse. If it's Microsoft SQL Server, Oracle or Terradata, you really must make sure that all its critical features are very well supported. Some of these features will be supported out of the box, while for others you'll have to get support for third party vendors. Which does increase your acquisition costs.

· Evaluate the Ecosystem that's build both around your Database engine and your future ETL tool. This is exactly the kind of investigation that will help you make sure that your not entering head down in a closed system. As long as right now you have everything you need in the system is fine, but, sooner or later you'll need something which isn't provided out of the box. Which means that you have to make sure that there are enough independent vendors around your Database engine and ETL tool to offer you this support.

As an example, I’ll mention the CDC support included into Microsoft SQL Server Integration Services (SSIS). Which, starting with SQL Server 2008, does offer support for CDC but only for SQL Server. If you want support for Oracle CDC into SSIS, you have third party vendors like Attunity, which do offer such integration.

· How many distinct data sources you have right now for your Data Warehouse? What new type of data sources do you forsee that you will have? You should start with the classics, structured data types:

     - Flat text files
     - XML
     - Database access via OLEDB drivers
     - Web Services 

You should also be aware of the whole world of unstructured data out there, and make sure you also have a plan for this. As you already know, people are very addicted to new social technologies like Twitter or Facebook, and your company definitely wants to be there. Which will also present a lot of new challenges for the entire BI system, in order to be able to integrate and interpret this data. And your challenge will be VERY different from the integration of structured data.

· Flexibility is the name of the game. Beside standard scenarios, like importing a list of tables from a source server to the Data Warehouse, you'll have all sorts of complex scenarios to be handled. Most of them should be handled by using the out of the box components, so you'll have to make sure that you have proper support for components that replicate programming logic, like: IF, WHILE, FOR.

· Support for extensive scripting language is a MUST. This would allow the IT department to develop brand new functionalities which aren't provided out of the box, and aren't available from Third Party vendor or are too expensive. Also, make sure that this scripting language is widely used, (ex. C#, Visual Basic .NET), and very good debugging tools are included within the ETL tool.

· Community support matters. ALOT. You'll need it each time when you have a problem. One of the best ways of evaluating a vendor will be to see how much unofficial support is it out there. Nothing replaces people enthusiasm or ingenuity when encountering a problem. And you'll definitely want to have such people around you when you have a problem. And that's because, in real life, the official support provided by the vendor is never enough.

· Productivity and Team Work. In the real world, you'll have large projects, on which you'll have many developers working simultaneously with the tool. You must make sure that the tool is really supporting such scenarios, and that you implement a proper ETL Development strategy and Architecture.

Let’s say you want to create an Import component, which does import data from different sources to your Data Warehouse. And let’s say you use SSIS as your ETL tool. What would you do? The easiest solution would be to create a single import package per Data Source, and a Data Flow component for each table. And if you'll go on this approach you'll notice that your architecture, rather than the tool, will become the bottleneck. But more on this in a future article.

· Source Control Support. The ETL tool must offer proper integration with the most widely used Source Control Solutions on the market, in order to offer versioning control. Additionally, you should be able to visualize both graphically and in a detail report, from within the ETL tool, which are the changes which occurred between 2 package versions.

· Error Handling for both logged and Bulk logged operations. Errors are part of the life so we must make sure that the ETL tool has the necessary components to handle them. While most of the ETL tools offer support for Transactions, which can be rolled back in case of failure, frequently there is very little error handling for bulk logged operations like Bulk Insert. This would involve the implementation of a more complex mechanism, like Bulk Load and Partitioning on a destination table, but it's possible to have real error handling even for non logged operations. And we should make sure that the ETL tool is able to handle it. More details on the use of partitioning can be found in the Table Partitioning Articles series.

· How well does it handle complex scenarios, like changing the Data Warehouse Database Engine?

One of the promises of using a dedicated ETL tool is that it encapsulates the logic of the performed Data Transformations and the applied Business Logic, from the Database engine. This would allow, at least in theory, for a company to proceed which changing the Data Warehouse Database Engine (ex. from SQL Server to Oracle or TerraData), with very little changes to the ETL component. And this is one of the areas in which a dedicated ETL tool should really deliver a great Return of Investment, compared to a classical ETL approach, where your business logic is encapsulated within Stored Procedures. Since those Stored Procedures are very specific to the Database Engine, the decision of changing the Database Engine will almost certainly result in the loss of your entire ETL. And this is why choosing the right ETL from the beginning is so important for a company.

· Automation of Data Compare Mechanism. Once you make a development on the ETL, you should be able to use an integrated component, in order to validate the differences between a previous version of the table data, and the latest version of the table data. The differences reported should cover both Missing Rows and Data Inconsistency, based on the defined mappings between the 2 table versions.

· Define Complex Business Rules. The ETL tool should allow the programmers / ETL designers to define complex business rules, and provide a degree of flexibility similar to the one provided into a Stored Procedure. Many Data Warehouse implementations are using a Hybrid Approach, which combines the power of the ETL workflows with the flexibility of stored procedures in order to rapidly define new business rules. However, placing business rules into Stored Procedures has many disadvantages, and one of the most important is the loss of visibility over the applied rules.

A mature ETL tool should allow the definition of complex rules, without the need of code writing, either by Stored Procedures or custom scripting. Additionally, an ETL tool should also provide support for interaction with a specialized Business Rules Engine Tool, which offer the users the possibility of defining new business rules without any programmer interaction. But more on this you may find in the Implementing a Business Rule Engine article. 

· Data Recycling. As long as you have a running ETL system, sooner or later you will have some data which doesn't match the integration rules defined in the system, and these records will be rejected from integration. The users must have the possibility to intervene and correct these rejected records, via a easily accessed interface, until the problem is finally fixed in the Source System. After the correction, the ETL flow will be relaunched and these user corrected data will be hopefully successfully integrated

· How easy is to find on the work market people for that technology? How much time is necessary for training before becoming productive?

Finding good people it's always hard, no matter the technology. But if the technology is highly specialized and not widely used in the geographical area were you company it's located, things are even more complicated. Additionally, as a result of the specialization, the wages are typically higher than the average.

Training new people in the use of the technology is always an option, but its success depends on a lot of factors like the availability of specialized trainings and the complexity of the technology. Realistically, when you have to train people, there will always be a significant gap between the beginning of the training and the moment in which they are becoming productive. And, lets not forget, specialized training costs a lot. Therefore, choosing a widely used technology on the market it's always the better option, unless that technology really doesn't solve your needs.

· Cost is always and issue

The best ETL tool is the one that you can afford it. Various tools offer different functionalities at different price tags. But the real question here is if you really need all those additional features, and if you really can afford to pay the extra price tag. For example, if you want a mobile phone mainly to call and be called, and if you don't need all the Camera, Social Networking and Applications, shouldn’t be more wisely to buy the 100$ phone than the 500$ one?

Still, while making your choice, you should be aware of the hidden costs driven by the loss of flexibility. Make sure you have enough power in the ETL scripting language in order to cover for the loss of out of the box features offered by more expensive products. An ETL tool must be not only have a decent price at acquisition, but also keep down the costs of maintaining and developing the system, for its foreseeable usable life. If your company's business evolves beyond your chosen ETL needs, than it will be time for a change. And change is something called Evolution.

And, at the end, since we were talking of horses, don't be afraid to put more horses at the Data Warehouse carriage. If your company has different needs, which cannot be satisfied by a single ETL, you should go for the power provided by different tools. This will help you consolidate your existing investment, while also offering the right tools to tackle the new challenges. And because sometimes, teaming up the Pure Racer, the Work Horse and the Mule is the best solution.

And now, back to you. Which were the main criteria that you've used to choose the ETL tool of your company? Which were the main challenges that you've meet and how have you solved them? What would you add to the above criteria’s for choosing the right ETL? What ETL tool would you associate to the above horse types?


Data Warehouse Best Practices

Pingbacks and trackbacks (1)+

Add comment

  • Comment
  • Preview



The ideas you'll find on this site represent the opinion of their owner. They are not either about or being endorsed by a specific company. This site is a place of sharing ideas and trying to find together best practices on Data Warehouse and Business Intelligence projects.

Some of the technical approaches will be praized, some of them will be criticized, based on the concepts presented by their owners. You shoudn't feel offended if some of the ideas which either you or your company consider best practices, here are being criticized. You should also realize that what some people consider best practices on a particular system and business scenario, can be worst practices on another.  If you don't agree with a particular idea, feel free to comment it in a civilized manner.

Only by crashing different ideas and experiences we will find a better way of doing Data Warehouse and Business Intelligence projects.