310 likes | 316 Views
Explore ETL (Extract, Transform, Load) tools vs. handcrafted code techniques for data quality, with insights, examples, advantages, and PL/SQL tips.
E N D
Do You Need an ETL Tool? Ben Bor NZ Ministry of Health
Ben Bor • Over 20 years in IT, most of it in Information Management • Oracle specialist since version 5 • Involved in Business Intelligence for over 10 years • Consulted the world’s largest corporations • Presents regularly on Information Management • Was annual Guest Lecturer at Sussex University
Contents • What is ETL • ETL tools vs. ‘handcraft’ code • PL/SQL techniques
What is ETL ETL = Extract, Transform and Load: • Any source, target ; • Built-in complex transformations • Point-to-point vs. hub-and-spoke
Our Own ETL Requirements Data Quality Flat Files SQL Loader PL/SQL PL/SQL
Tools or Handcraft? ETL Advantages: Graphic User Interface Automatic documentation Off-the-shelf set of ready-to-use transformations Built-in scheduler Database Agnostic Handcrafting Advantages: No limitation reuse existing code & non ETL No specific methodology No license cost No impact on infrastructure Transportable Release & Code-Management by script
Oracle ETL Facilities • External Tables • Merge • SQL Loader • PL/SQL • Database links
Why Use PL/SQL • Integrated environment (no installation required) • Available resources • Reuse code ‘snippets’ • Good performance • Integration with and control of the database
PL/SQL Tips and Techniques Quality Techniques Tricks
What is Quality? [1] “Totality of characteristics of an entity that bears on its ability to satisfy stated and implied needs.“ [The ISO 8204 definition for quality]
Quality 2 [2] Quality is a collection of “ilities”: • Reliability - operate error free • Modifiability - have enhancement changes made easily • Understandability - understand the software readily • Efficiency - the speed of the software • Usability - use the software easily • Testability - construct and execute test cases easily • Portability - transport the software easily
Quality 3 [3] “All the things you do today in your software development, in order to bear fruit in the future.”
Standards & Conventions Use meaningful names V_Number_Of_Items_In_Array vs. i or no_itms Distinguish between types: V_ Variable a_ Parameter C_ Constant G_ Global constant
Using Packages • Central package with utilities and all output • All error messages and numbers • All common constants (date format etc’) • Global variables • Statistics data • Other packages encapsulate related logic • Within package: • Procedures & functions have: • Meaningful name • A99_ prefix. A is the level (A highest). 99 unique ID
Example: procedure and variable naming XXX_Write_Flat_File.U03_Write_Record_To_CSV( a_File_Handle, C_Field_Delim, C_Field_Separ, C_Record_Separ, RM_REFERENCE_rec.REFTYPE, RM_REFERENCE_rec.CODE, RM_REFERENCE_rec.DESCRIPTION, To_Char(RM_REFERENCE_rec.ISDEFAULT ,'9') );
Techniques Error logging Autonomous Transaction Run statistics Release mechanism Overloading
Error Logging Technique Global variables keep key information: • Record ID • Run ID • Location in code Local error trapping decides severity and error code. All error trapping passed up.
Error Logging Structure TABLE ERROR_LOG( ERR_TIME DATE, ERR_NUM INTEGER, SOURCE_URN VARCHAR2(20), SOURCE_SYSTEM_ID VARCHAR2(5), PLACE_IN_CODE VARCHAR2(64), ERR_LOCATION VARCHAR2(255), ERR_DESCRIPTION VARCHAR2(512), SEVERITY NUMBER(6) ) ERR_TIME 18-OCT-02 10:04:52 ERR_NUM 1001 SOURCE_URN 223010913 SOURCE_SYSTEM CRS PLACE_IN_CODE In FLIP_PKG B06 ; 6(utils A08) ERR_LOCATION A08_Lookup_Type ERR_DESCRIPTION No match found for [Plan_Code] value [C3] SEVERITY 10
Autonomous Transaction -- =================== PROCEDURE E00_write_error_log( -- =================== a_err_num INinteger, a_Severity INInteger, a_err_location INVarChar, a_err_description INVarChar) IS PRAGMA AUTONOMOUS_TRANSACTION; V_Place_In_Code DW_Process.Error_Log.Place_In_Code%Type; BEGIN V_Place_In_Code := G_Place_In_Code || '(utils ' || G_Place_In_UTILS_Code || ')'; INSERTINTO DW_Process.Error_Log ( err_time, err_num, Severity, BOROUGH_ID, SOURCE_URN, SOURCE_SYSTEM_ID, Place_In_Code, err_location, err_description ) VALUES ( sysdate, a_err_num, a_Severity, G_BOROUGH_ID, G_SOURCE_URN, G_SOURCE_SYSTEM_ID, V_Place_In_Code, a_err_location, a_err_description ); COMMIT;-- commit the autonomous transaction, outside transaction is unaffected. G_Stats_Rec.TOTAL_NO_OF_ERRORS := G_Stats_Rec.TOTAL_NO_OF_ERRORS +1; -- =================== END E00_Write_Error_Log ; -- ===================
Run Statistics • G_Stats_Rec is a record with all the statistics fields • Defined in the central package (therefore resident in memory) • It is updated by the writing procedures (all central) • It is written out at the end of the run
Release Mechanism • Table of ‘release notes’ • Each package has C_Version constant updated each release • ‘Show_Version’ scripts display versions and notes • Results shipped with each release
Remove Spaces -- =================== FUNCTION A04_Remove_Spaces( -- =================== a_Instring INVarchar) ReturnVarchar IS /* ** Removes all the spaces from a string, leaving the rest of the printable characters */ BEGIN G_place_in_UTILS_code :='A04'; -- For use by the error trapping routine RETURN TRANSLATE( a_Instring, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890’ || '\|,<.>/?#~@;:[{]}=+-_`¬!"£$%^&*() ', 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890’ || '\|,<.>/?#~@;:[{]}=+-_`¬!"£$%^&*()'); -- =================== END A04_Remove_Spaces ; -- ===================
Strip Leading non-numerics -- ============================ FUNCTION F09_Strip_Leading_non_digits( -- ============================ a_String INVARCHAR2) RETURNVARCHAR2 IS /* ** Remove leading non-digits from the input. ** Example: Input string: 'abcde12345edcba' ** Output string: '12345edcba' */ v_string Varchar2(4000); v_first_digit_pos Integer; BEGIN -- Replace all digits by 0 v_string := Translate(a_String,'1234567890','0000000000'); v_first_digit_pos := instr(v_string,'0'); RETURN F01_Right(a_String, v_first_digit_pos ); -- ============================ END F09_Strip_Leading_non_digits; -- ============================
Overloading -- ======================= PROCEDURE U03_Write_Record_To_CSV( -- ======================= a_File_Handle INutl_file.file_type, a_Field_Delim INVarChar,-- the quotes, for CSV a_Field_Separ INVarChar,-- the comma , for CSV a_Record_Separ INVarChar,-- the Carriage Return + Line feed , for CSV a_String1 INVarChar:= G_default_Value , a_String2 INVarChar:= G_default_Value , a_String3 INVarChar:= G_default_Value , . . . ) IS BEGIN IF a_String1 = G_default_Value THENGOTO End_Of_Record ;ENDIF; U02_Write(a_File_Handle, a_Field_Delim || a_String1 || a_Field_Delim); IF a_String2 = G_default_Value THENGOTO End_Of_Record ;ENDIF; U02_Write(a_File_Handle, a_Field_Separ || a_Field_Delim || a_String2 || a_Field_Delim ); IF a_String3 = G_default_Value THENGOTO End_Of_Record ;ENDIF; U02_Write(a_File_Handle, a_Field_Separ || a_Field_Delim || a_String3 || a_Field_Delim ); . . . <<End_Of_Record>> U01_Write_Line(a_File_Handle, a_Record_Separ); -- ======================= END U03_Write_Record_To_CSV ;--------------------------------------------------------------------------------------------------------------------------------------------------------------- -- =======================
Summary ETL or PL/SQL? Your choice. • Consider: • Overall cost • ‘Politics’ • Convenience • Portability • Speed of development • Reusability IF PL/SQL : ensure Quality
Thank you ! I can be contacted at ben_bor@moh.govt.nz