Sunday, May 9, 2010

ETL Scripting with Open Source ETL Tools

2 comments

Extract Transform Load (ETL)
Extracting, transforming, and loading (ETL) is a process used in database especially in data warehousing and have followin steps
Extracting data from sources (The transactinal Database)
Transforming it to fit operational needs (which can include quality levels)
Loading in destination (database or data warehouse or staging where you practice about data)

Extraction:
The first part of an ETL process involves extracting the data from the source systems usually called a transactional database where actual transactions are perforemd. Most data warehousing projects consolidate data from different source systems like Relational Databases.
Extraction converts the data into a format for transformation processing. An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure. If not, the data may be rejected entirely or in part. This step invloves extracting all the data as mentioned in the written SQL queries and procedure from the source system.

Transformation:
The transform stage applies a series of rules or functions to the extracted data from the source to derive the data for loading into the end target. Some data sources will require very little or even no manipulation of data 

Loading
This step involves process lodaing data which has extracted from one source database say MYSQL to load in SQLSERVER The load phase loads the data into the end target, usually the data warehouse (DW). Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative, frequently updating extract data is done on daily, weekly or monthly. while other DW (or even other parts of the same DW) may add new data in a historicized form, for example, hourly
  
Example:
It might be your requirement that you need to accumulate data that is residing in different database like wise Customer Sales information on MYUSQL  and its support and maintenance information on SQL SERVER so here you need to gather the information in order to present it the ETL script is the best solution for you.

BEST Practices:
Four-layered approach for ETL architecture design 
  • Functional layer: Core functional ETL processing (extract, transform, and load).
  • Operational management layer: Job-stream definition and management, parameters, scheduling, monitoring, communication and alerting.
  • Audit, balance and control (ABC) layer: Job-execution statistics, balancing and controls, rejects- and error-handling, codes management.
  • Utility layer: Common components supporting all other layers.
Open Source ETL Tools:
  • Apatar
  • CloverETL
  • Flat File Checker
  • Jitterbit 2.0
  • Pentaho Data Integration (now[update] included in OpenOffice Base)
  • RapidMiner
  • Scriptella
  • Talend Open Studio

2 comments: