Open Source Extraction, Transformation & Loading software

Extraction, Transformation & Loading : often expensive !

ETL or the art of extracting, transforming and loading data is a key process in developing and maintaining a data warehouse, the foundation for sound Performance Management. It can however be an expensive process: not only because it requires a lot of resources and state of the art hardware, but also because the well known ETL-tools from IBM, Informatica, SAS or SAP/Business Objects do not come cheap. Only companies who are ready to invest substantially in Business Intelligence & Performance Management seem to be able to afford these tools. So what about smaller businesses: is an ETL tool out of their reach and are they condemned to writing their own hand coded ETL queries (in SQL, PL SQL) to transform their data ? It is maybe a good idea to investigate if there are inexpensive alternatives that have more or less the same functionality. In doing so, Open Source software comes to mind … When can Open Source ETL be considered a viable option and is it "ready for main stage” ?

Open source software : the more affordable alternative ?

Open Source ETL-tools could be a possible, cheaper alternative to custom-build applications. The organization can save many man hours of SQL work by implementing an Open Source ETL-tool, and money by avoiding the purchase of an expensive ETL tool, with potentially too many capabilities for the problem or budget at hand.

Open source also helps driving adoption of standards. The transparency of Open Source code and the large number of testers available in a community typically forces a sound implementation of these standards. And when standards are still in a transitional phase, developing Open Source code helps ensure that those writing the standard specification are creating something that can actually be used.

Community-driven software is seen as a reaction to the big software vendors, and wants to be a valid alternative for expensive, commercial programs. In Open Source, the intellect of a group of developers is united, and every development becomes available for every user. This is, in fact, a very attractive part of Open Source software and thus also Open Source ETL-tools. In a vibrant Open Source community, new features are quickly integrated in the tool, problems are solved in a faster way, there are enough resources to test and try out the tool and because of the diversity of programmers, people can come up with inventive and sometimes revolutionary ideas.

Open source ETL, really "open” & community-driven ?

Of course, for this to work, the community needs to be big enough to have sufficient resources to do the design and programming. In the case of LINUX, this was never a problem because the developers united in a "quest” against Windows. For ETL however, there does not seem to be a common motivator, and it is not particularly a subject that keeps a lot of people awake at night. It is a specialist's world, and although some Open Source ETL-tool vendors have supporters, progress might be significantly slower than in the case of other –more successful- Open Source programs. As a result the Open Source ETL world indeed is a very small community.

The most advanced suppliers of Open Source ETL actually admit today that they are not really community-driven, and that each company works with their own developers. Hopefully, this will change in the future, because to develop the complex transformations or the connectors to different data sources, a little bit of help is more than welcome.

Open source ETL : the main players

In the Open Source ETL world, three offerings are maturing :

These vendors offer a Java and XML based ETL tool with just enough transformations to make it a viable option and all three come with a user interface.

Pentaho & Talend are the most advanced providers of Open Source ETL-tools on the market. Pentaho and Talend develop offerings beyond just ETL. Talend is a real "data integration” offering, with data quality and profiling capabilities, while Pentaho are positioning themselves as a complete business intelligence provider, including a front-end BI suite. Clover, also offers a GUI, but you have to install ECLIPS, a Java development platform, to use it. Clover is very Java centric and as a result less user friendly than Pentaho and Talend.

The real advantage of Open Source vendors lays in the fact that they can provide, for a reasonable fee, the user with support, documentation and the necessary training. Without this, one can wonder whether the use of Open Source software is an option to a company or organization.

Open source ETL, when to consider …

There are a few project characteristics where an Open Source ETL tool can be the right solution for your company.

No complex transformations

Instead of using SQL statements to transform data, an Open Source ETL tool gives the developer a standard set of functions, error handling rules and database connections. The integration of all these different components is done by the Open Source ETL tool provider. The straightforward transformations can be implemented very quickly, without the hassle of writing queries, connecting to data sources or writing your own error handling process. When there are complex transformations to make, Open Source ETL tools will often not offer out-of-the-box solutions.

Few data sources to connect to

Most reputed ETL-vendors provide an extensive amount of connections to all sorts of data sources. This is a problem with Open Source ETL tools: most of them are based on Java architecture and need JDBC to connect to a database. In the basic license, a few connections are available but when there is a need for extra connections, the customer has to pay an extra fee and/or for some platforms (like mainframe sources) nothing might be available.

Small scaled projects

Small scaled projects with one or two, very common sources and a limited number of target tables don't need an expensive and complicated ETL tool. An Open Source ETL tool can bring a cheap and easy to install solution, which even could run on a local PC.

Open source ETL, when to avoid …

The main obstacles for the use of Open Source ETL are:

Java & XML knowledge required for complex transformations.

As all the Open Source ETL tools are Java and XML based, you need at least a basic understanding and knowledge of Java and XML to get the most out of them. While a classic ETL tool is quite easy to learn and even within reach of the non-IT person, there is more IT knowledge necessary for Open Source ETL-tools. Especially when you need more complex transformations, Java or XML knowledge will need to be available in your company. Most of the vendors offer a basic set of transformations, but in a lot of cases more advanced rules are necessary to get your data in the right shape. This is a major disadvantage for the Open Source ETL- tools: the user has to develop the transformation rules, and therefore needs an advanced knowledge of Java. Not all of the smaller companies have such resources at hand.

Connectivity to multiple & less common data sources required.

The classic ETL requirement is to extract data from common databases like Oracle and DB2, connect to ERP data sources (SAP) or treat data from mainframe systems. Commercial ETL vendors spend a considerable amount of time and effort just on developing connectivity functionality to all possible sources. Every new version of each database and ERP system needs extra development and certification of the connectivity. The classic tools additionally offer fast native database connections, or parallel database connections, and can handle huge amounts of data in a short time. Open Source ETL-tools are Java based and need a JDBC driver to connect to a data source. Although a basic set of connections is delivered by most of the Open Source ETL-tool vendors, no connector to the classic databases or enterprise applications comes free of charge, if available at all. Speed & performance also is an issue: all classic ETL tools will outperform the Open Source ETL-tool, as specific performance tuning has happened and more real-live customer environments are available to discover problems & optimize the solutions. Some Open Source tools offer cluster distribution support, but this is very basic in comparison to the real parallel architecture of solutions like IBM DataStage or Informatica.

Lack of skills, knowledge & resources.

There are plenty of consultancy companies that offer a wide range of resources for the traditional ETL tools. When a company starts up an ETL project and internal resources are scarce or an initial coaching model is chosen, skills will be available in the market place. Almost no such skills pool really exists for Open Source ETL-tools at this point. Although Pentaho and Talend provide support, documentation and training, they don’t have consultants that can go to customer sites and a partner network still needs to be developed. This can make the start up and the first programming steps of a project a difficult exercise, and users can quickly lose faith in the tool as a result. Finally, the lack of knowledge in the marketplace might quickly make you quite dependant on the few available resources.

Conclusion

Open Source ETL tools are getting better, and a few of them are starting to get closer in functionality and user friendliness to the big enterprise tools. If the Open Source ETL community would start working around certain critical, currently incomplete features like complex transformations and data source connectivity, the solutions might become a force to reckon with.

At the moment, Open Source ETL-tools can be a valid alternative for custom code (SQL, PL SQL) ETL applications for smaller organizations. Custom code anyway should be avoided as it quickly becomes a nightmare to maintain, and it gives a lot of work to the developer. If your company has good Java and XML resources, doesn't need a complex ETL process and is prepared to contribute a bit of time and effort in the Open Source community, an Open Source ETL tool might be an option. Consider you will be a forerunner and be prepared for extra investments in skills build-up, lack of knowledge in the marketplace and potential support issues like bugs.

For complex enterprise ETL processes on the other hand, current Open Source ETL-tools are not ready "for main stage”. In this kind of environment, the powerful, commercial ETL tools are still the way to go.

Open Source Extraction, Transformation & Loading software: ready for mainstage ?