240 likes | 619 Views
What is ETL?. ETL: Extract Transformation and Load The term is used to describe data migration or the data conversion process ETL may be part of the regularly repeated business process
E N D
What is ETL? • ETL: Extract Transformation and Load • The term is used to describe data migration or the data conversion process • ETL may be part of the regularly repeated business process • The amount and complexity of data grew dramatically over the past decade so the ETL processes became more complex and demanding
Phases of ETL Project • Requirements • Analysis • Design • Proof of Concept • Development • Testing • Execution • Verification
Requirements • Scope of the data migration - what data is required in the target system and how is it going to be used? • Execution requirements – has to be within certain timeframe, sequence, geographic location, repeatability, acceptable system down time, etc. • Source data retention period, backup and restore requirements • Requirements should be made with this in mind: • Data is the company’s most valuable asset. • Consequences of corrupt data are usually very costly.
Analysis • Understanding the source data • Data Dictionary that is used for designing ETL process has to be created • Mission critical task • Frequently underestimated (importance and time) • All available resource should be used to do analysis properly: • Available system documentation including Data Model and Data Dictionary • People • Reverse engineering
Design • Choice of methodology • Choice of technology • Design Target Database • Design ETL process • Data Mapping Document • Map source data to the target database • Specifies transformation rules • Specifies generated data (not from source) • Design ETL verification process • Ensures that all requirements are addressed
Proof of Concept • Helps to determine or estimate: • Feasibility of the concept • Development time • Performance, capacity and execution time • Requirements to be met • Gain knowledge about the technology • Code produced in this phase usually can be re-used during the development phase
Development • Includes: • Produce code and processes as per Design and Data Mapping Document • Data verification scripts or programs as per Test Plan • Execution scripts as per Execution Plan • Unit testing – performed and documented by developers. • Typical Challenges: • Inadequate requirements and design documents • Developers unfamiliar with technology
Testing • Ensures that requirements are met • Test Plan is highly recommended • Types of testing: • Functional, stress, load, integration, connectivity, regression • Challenges: • Automation and repeatability (testing and verification scripts) • Creation of the Test Data • Extracting small data sets from large data volumes • Confidential data may not be made available for testing
Execution • Execution plan should include: • Sequence of tasks • Time of execution and expected duration of execution • Checkpoints and success criteria • Back out plan and continuation of business • Resources involved • For mission, critical system down time could be limited or even entirely unacceptable • Execution should be controlled and verified
Verification • Confirms that the data migration was successful • Determined during the design phase • Various methodologies and technologies could be used • Automated verifications are highly recommended (driven by requirements)
Why ETL Projects Fail? • Underestimate complexity of the project • Overlook or neglect phases of the project • Wrong choice of technology • Common misconceptions like • Expensive ETL tools will solve all problems • No or very little programming will be required • We don't need or we don't have time for plans, but we know exactly what we need to do
Problems with Mainstream ETL Tools • Maintaining license and consultants is very expensive • Significant time required to learn • Usually require dedicated hardware • Cannot take advantage of the database vendor proprietary technologies that are optimized for the fastest data migration • Complex tasks very often require integration with other technologies • Very limited performance • Only small amount of provided functionality is actually required for ETL project • Very limited application for Data Analysis • Huge discrepancy between marketing promises and actual performance
Malloc Inc. • During the 15 years in the IT Consulting business, a proprietary ETL methodology and technology was developed • Consists of two major modules: • Database Analyzer • G-DAO Framework • Major advantages: • Inexpensive, easier to learn and performs better than mainstream ETL Software • Any Java developer can master it and start using it within several days • It is proven and it works
Database Analyzer • Produces ETL Data Analysis Reports in various formats • Major usage: • To analyze and understand source data and the database attributes • Create data mapping and transformation documents • Create a data dictionary • Suggests ways to improve database design • Valuable source of information for Business Analysis, Data Architects, Developers, and Database Administrators.
Database Analyzer Reports • Intuitive, descriptive and easy to read • In HTML format • Can be imported and edited in major document editors such as MS Word
Graphical User Interface • User friendly GUI Interface • It can also run in Batch mode for lengthy analysis (large data sources)
G-DAO Framework • Java Code Generator • Eliminates huge legwork to develop the code required for the ETL Process • Uses analysis performed by Database Analyzer to produce the code optimized for a particular database • Code may be used for purposes other than ETL (any kind of database access and data manipulation) • Uses the advantage of an almost unlimited world of java libraries (no proprietary language and interfaces)
G-DAO Framework Advantages • Data can flow directly from source to target (no need for intermittent storage into files) • XA transactions are supported for all major databases • Functionality is limited only by the limitation of the JDBC driver and Java language • Easy to learn and implement • No dedicated hardware required • Provides a platform for any kind of business application that requires data access
Malloc Inc • Incorporated in 1993 in Toronto, Canada • Provided IT Consulting to: • Oracle Corporation • General Electric • Citibank • Royal Bank • Bank of Montreal • The Prudential • Standard Life • and many more • Most recent implementations of Database Analyzer and G-DAO Framework • Citibank • Royal Bank of Canada • http://www.mallocinc.com