250 likes | 421 Views
A Comparative Study between ETL and E-LT approaches for loading data into a Data Warehouse. Vikas Ranjan CSCI 693. Agenda. Introduction Extract, Transform and Load (ETL) Approach Strengths/ Weaknesses of ETL Extract, Load and Transform (E-LT) Approach Strengths/ Weaknesses of E-LT
E N D
A Comparative Study between ETL and E-LT approaches for loading data into a Data Warehouse Vikas Ranjan CSCI 693
Agenda • Introduction • Extract, Transform and Load (ETL) Approach • Strengths/ Weaknesses of ETL • Extract, Load and Transform (E-LT) Approach • Strengths/ Weaknesses of E-LT • Experiments • Results • Conclusion/ Future Work • References • Q & A
Introduction • All Business Intelligence applications are data-centric. • Large volumes of data is stored and processed along with its history in the Data Warehouse. • The Data is extracted from various heterogeneous source systems and transformed as per the business requirements. • Data is used for analytical purposes- future forecasting, profitability analysis, trend analysis etc. to drive the business.
Extract, Transform and Load (ETL) • Traditional approach of loading data into data warehouses. • Data is first pulled/ pushed from various heterogeneous sources like ERP, CRM, RDBMS, Flat Files. • Business rules are applied on the data in the Staging Area. • Transformed data is loaded into target database. • Often designed backward, thus only the relevant data is fetched.
Strengths/ Weaknesses of ETL Strengths • Can perform complex operations in single data flow diagrams. • Mostly designed backward, thus only relevant data is loaded. • Used for building real-time data warehouses. • Robust tools are available like Informatica, Ab Initio, Data Stage . Weaknesses • Data transformation step of ELT is performed by ETL engine, therefore increasing processing time. • Data is moved over the network twice. • Since it is developed backward, more effort for future redesign.
Extract, Load and Transform (E-LT) • Newer Approach of loading into target data warehouse. • The Data is extracted from various sources same way it is done via ETL. • This extracted data is loaded directly into target data warehouse. • The transformations and complex business rules are applied by native SQL drivers. • The processing is done by database engine rather than ETL engine.
Strengths/ Weaknesses of E-LT Strengths • Since all the data is available, the future changes can be easily incorporated. • Once the data is loaded on the target platform, all transformations are placed on the RDBMS engine. This reduces network congestion. • Provides optimal performance as no extra hardware needed. Weaknesses • E-LT suffers from a limited availability of tools like Informatica Pushdown, Data Integrator etc. • Does not work well for complex business cases. • E-LT cannot be used to design near real-time enterprise data warehouse.
Tools and Data Set Software / Hardware • OS Platform: Sun M9000 Server (www.sun.com). • Relational Database Management System: Teradata V2R6.2.1 (www.teradata.com). • ETL/E-LT Tool: Informatica PC 8.6.1 Hot Fix 9 Advanced Edition (www.informatica.com). Data Set • All the data used is experiments is obtained from the test database a telecom company. • All the experiments were conducted using Informatica as ETL and E-LT tool.
Experiment 1 ETL vs. E-LT (Full Pushdown): Informatica has recently introduced both ETL and E-LT capabilities in its Power Center tool Job Name: Tax _Write_Off Job • Same Job was developed using ETL approach as well as E-LT approach. • Informatica pushed down all the code processing to the RDBMS engine.
Experiment1 contd. ETL Job: Informatica Server handled all the code processing and generated its own internal SQL
Experiment 1 contd. E-LT Job ( Full Pushdown): In the full pushdown, Informatica Server pushed all the code to RDBMS engine and Informatica Server worked purely as an E-LT tool
Experiment 1 Results Approximately Five Times more performance gain using E-LT Full Pushdown
Experiment 2 ETL vs. E-LT (Target Pushdown): • Job Name: Tax_Extract_ETL_tst Job • Same job was developed using ETL approach as well as E-LT approach with Target Pushdown only • The Pushdown to database engine happened at Target database only
Experiment 2 contd. ETL Job: Informatica Server handled all the code processing and generated its own internal SQL
Experiment 2 contd. E-LT Job (Target Pushdown):Informatica Server pushed code processing on target database to RDBMS engine
Experiment 2 Results No Performance Differences since the Pushdown was on Target Database only
Experiment 3 ETL vs. E-LT (Source Pushdown): • Job Name: Tax_Extract_ETL Job • Same job was developed using ETL approach as well as E-LT approach with Source Pushdown only • The Pushdown to database engine happened at Source database only
Experiment 3 contd. ETL Job:Informatica Server handled all the code processing and generated its own internal SQL
Experiment 3 contd. E-LT Job (Source Pushdown): Informatica Server pushed down code processing on source side to RDBMS engine
Experiments 3 Results • No Performance Differences as Pushdown was on Source Database only • Source Pushdown E-LT does not work with Teradata Sequence Generator
Results • Significant performance gains were obtained using full pushdown E-LT over ETL approach. • No performance gains using both source and target pushdown of E-LT approach and ETL approach as both used ETL Server and Database Engine resources( memory and CPU). • Code changes were needed to redesign existing ETL jobs to use E-LT power of database engine. • E-LT does not work in building Real Time Data warehouses
Conclusion and Future Work • ETL works well for very complex transformations and active data warehouses • E-LT works well for small and medium-sized data marts and when the source and target are on the same database platform only • Future: Building data warehouse solutions using hybrid approach (combination of ETL and E-LT processes) .
References [1] A. Simitsis, P. Vassiliadis, T. Sellis “Optimizing ETL Processes in Data Warehouses,” in Proc. 21st International Conference on Data Engineering, 2005, (ICDE 2005), pp. 564-575. [2] G.X. Zhou, Q.S. Xie, Y. Hu, “E-LT Integration to Heterogeneous Data Information for SMEs Networking based on E-HUB,” in Proc. Fourth International Conference on Natural Computation, 2008, IEEE, pp. 212-216. [3] I. William, S. Derek, and N. Genia, DW 2.0: The Architecture for the Next Generation of Data Warehousing. Burlington, MA: Morgan Kaufman, 2008, pp. 215-229. [4] R. J. Davenport, September 2007. [Online] ETL vs. ELT: A Subjective View. InSource IT Consulting Ltd., U.K. Available at: http://www.insource.co.uk/pdf/ETL_ELT.pdf [5] L. Troy, C. Pydimukkala, How to Use PowerCenter with Teradata to Load andUnload Data, Informatica Corporation [Online], Available at: www.myinformatica.com .