Modules:IDL ETL Tool Selection
Contents |
Introduction
This document describes the process of selecting the ETL tool used to build and execute the processes for the Initial Data Load project. The goal of this document is to keep track of the decision process, and to give to the community feedback in order to facilitate feedback.
Selection process
The goal of this selection process is to have a short list of ETL tools with an small prototype of data load built. After this goal is achieved, the final decision will be done based on the experiences obtained during the evaluation of the tools and the development of prototypes.
The final result of the decission process is the selection of the ETL tool used to build and execute the processes needed for the initial data load of Openbravo ERP, a prototype of the Initial Data Load project based on the tool selected, gain experience on the tool selected and in ETL tools in general.
The selection process will consist in the following steps.
- Indentify an initial list of ETL tools suitable to be elegible for the Initial Data Load Project.
- Create a selection criteria list requiered for an ETL tool to be elegible for the Initial Data Load project.
- Create a benchmark of the ETL tools identified with pros and cons, and select a short list of ETL tools from the initial list based on this selection criteria benchmark. This short list will be used to develop the prototype.
- Define a prototype scenario and goals to be implemented for each ETL tool selected in the previous step.
- Make a final decission based on the work done in the previous steps.
List of tools
- Apatar
- Clover.ETL
- Enhydra Octopus
- Jitterbit 2.0
- Pentaho Data Integration (Kettle)
- Scriptella ETL
- Talend Open Studio
- KETL
- Palo ETL Server
Selection criteria
This section lists the selection criteria which play a role in determining the list of ETL tools which we will prototype:
- Open source project with a suitable licensing model for Openbravo.
- Active user base and development team, active forums, reactive developers, regular releases of new versions, professional software development methodology (test-driven, attention for backward compatibility)
- Attention for documentation.
- Mature and complete steps set:
- Extraction and load steps: CSV files, XML files, Excel spreadsheets, Database tables, Web services, ..
- Transformation steps: Field mapping, lookups, joins, merges ...
- Good management of fields type and format.
- Good management of parameters and attributes of each process.
- Graphical interface tool to develop, test, run, deploy process, modify parameters.
- Logging / Debugging. Ability to identify the step and the records that failed / succeeded to load.
- Testable by automatic test procedures and test data.
- Rerunability, recoverability. Ability to rerun a failed process execution and to roll-back to the previous state of a failed process execution.
- Customizable and extendible by consultants: ability to extend developed process to include new data to load and new logic to execute.
- Deployment and execution options. Availability of server side execution, schedule execution, monitoring tools, alerting.
- Weight of the distributable runtime of the ETL tool. The less the better.
Tools evaluation (Adrián)
This is an small evaluation of the list of ETL tools identified. In my opinion the end the decision will be between Pentaho data integration and Talend Open Studio.
Pentaho data integration (Kettle)
Popularity: Google results > 64600, >10 forum messages a day.
- (+) Already experience in Openbravo. BI integration and POS-ERP integration.
- (+) Complies with all requirements. User interface, deployment tools, steps set ...
- (-) Ugly graphic designer but functional.
- (-) Problems when fine adjusting processes. When developing transformations what it takes more time is trying to adjust small stupid things.
Talend Open Studio
Popularity: Google results > 832000, >15 forum messages a day.
- (+) Complies with all requirements. User interface, deployment tools, steps set ...
- (+) Code generating approach. The process designer genereates java or perl code.
- (+) Nice graphic designer, based on Eclipse.
- ( ) Small experience in Openbravo. Jaime Torres and Juan Pablo Aroztegui.
- (-) Graphic designer learning curve bigger than Pentaho.
Jitterbit 2.0
Popularity: Google results > 44100, ~2 forum messages a day.
- (+) Nice graphic designer
- (-) Server scheduler and logs not available in community version.
Apatar
Popularity: Google results > 45300, ~2 forum messages a day.
- (+) Nice graphic designer
- (--) Reduced set of transformation steps and reduced functionality.
Clover.ETL
Popularity: Goggle results > 25200, ~2 forum messages a day.
- (+) Nice graphic designer
- (--) Designer and server are not open source, are commercial tools. Only the engine is open source.
Palo BI Suite
Popularity: Google results > 30500, ~ 2 forum messages a day.
Enhydra Octopus
Popularity: Google results > 13700, No official forums found.
Scriptelle
Popularity: Google results > 5540, Forums seem dead.
KETL
Popularity: Google results > 729, Forums seem dead.
Prototype
This section describes the capabilities and goals to achieve in the prototype built for the tools selected.
Scenario
- Load of CSV / XML file with > 1000 rows and with columns of different types and formats.
- Save to different database engines: Oracle and Postgre
- Save different field types: Strings, integers, decimals, dates, booleans, ...
- Lookup of fields in the ERP database or CSV /XML files.
- Complex fields transformations: concatenations, based on conditions, based on function of other fields, UUID creation.
- Complex row mappings: Save an input row to different tables, save different input rows to the same table row
- Use of process parameters
- Execution of ETL processes: results, logging, debugging,...
Goals
- Validate the functionality provided by the ETL tool
- Validate the usability of the ETL tool: Ease of use, documentation, support.
- Validate the coverage of the functionality needed by the Initial Data Load Process
- Load of CSV / XML files
- Save to database
- Transformation steps
- Parameter management
- Validate the technology
- Get a more detailed information for the estimation of the project, tasks, people needed and skills needed.