410 likes | 548 Views
MIS 4346/5346 Data warehousing. Extract, Transform, Load. Agenda. Review Analysis (Bus Matrix, Info Package) Logical Design (Dimensional Modeling) Physical Design (Spreadsheet) Implementation (Data Mart Relational Tables) ETL Process Overview ETL Components Staging Area Extraction
E N D
MIS 4346/5346 Data warehousing Extract, Transform, Load
Agenda • Review • Analysis (Bus Matrix, Info Package) • Logical Design (Dimensional Modeling) • Physical Design (Spreadsheet) • Implementation (Data Mart Relational Tables) • ETL Process Overview • ETL Components • Staging Area • Extraction • Transformation • Loading • Documenting High-Level ETL Requirements • Documenting Detailed ETL Flows • Example ETL
Review: DM Implementation DimStudent FactEnrollment CREATE TABLE DimStudent( student_skintidentity(1,1), student_idvarchar(9), firstnamevarchar(30), lastnamevarchar(30), city varchar(20), state varchar(2), major varchar(6), classification varchar(25), gpa numeric(3, 2), club_namevarchar(25), undergrad_schoolvarchar(25), gmatint, undergrad_or_gradvarchar(10), CONSTRAINT dimstudent_pk PRIMARY KEY (student_sk)); GO CREATE TABLE FactEnrollment( student_skint, class_skint, date_skint, professor_skint, course_grade numeric(2, 1), CONSTRAINT factenrollment_pk PRIMARY KEY (student_sk, class_sk, date_sk, professor_sk), CONSTRAINT factenrollment_student_fk FOREIGN KEY (student_sk) REFERENCES dimstudent(student_sk), CONSTRAINT factenrollment_class_fk FOREIGN KEY(class_sk) REFERENCES dimclass (class_sk), CONSTRAINT factenrollment_date_fk FOREIGN KEY(date_sk) REFERENCES dimtime (date_sk), CONSTRAINT factenrollment_professor_fk FOREIGN KEY(professor_sk) REFERENCES dimprofessor (professor_sk)); GO
ETL Overview • Reshaping relevant data from source systems into useful information stored in the DW • Extract • Copying and integrating data from OLTP and other data sources in preparation for cleansing and loading into the DW • Transform • Cleaning and converting data to prepare it for loading into the DW • Load • Putting cleansed and converted data into the DW
ETL Process • Not Really New, BUT… • Much more data • Includes rearranging, summarizing • Data used for strategic decision-making • Characteristics: • Process AND technology • Detailed, highly-dependent tasks • Consumes average 75% of DW development • An on-going process for life of DW • Requirements: • Well-documented • Automated • Flexible
ETL Process • Determine target data • Determine data sources • Prepare data mapping • Organize data staging area • Establish data extraction rules • Establish data transformation rules • Plan aggregate tables • Establish data load procedures • Load dimension tables • Load fact tables
ETL Process Flow 3, Spreadsheet 1, Dim Model 2, Spreadsheet 5, SSIS 8, 9, 10, SSIS 6, 7, Map & SSIS 4
ETL Staging Area • Information hub,facilitating the enriching stages that data goes through to populate a DW • Advantages: • Separates source systems and DW • Minimizes ETL impact on source AND DW systems • Can consist of multiple “hubs” • “upload” area • “staging” area • “DW load images”
High Level Design of ETL Process • Initial documentation of: • What data do we need and where is it coming from? • Physical DW Design Spreadsheet shown previously • What are the major transformation/cleansing needs? • “Extend” Physical DW Design Spreadsheet OR • ETL Map • What’s the sequence of activities for ETL? • ETL Map
Common Transformations • Format Revisions • Key Restructuring, Lookup • Handling of Null Values • Decoding fields • Calculated, Derived values • Merging of Data
Common Transformations, cont… • Splitting of single fields • Character set conversion • Units of measurement conversion • Date/time conversion • Summarization • Deduplication
Common Transformations, cont… • Other Data Quality Issues • Standardize values • Validate values • Identifying mismatches, misspellings • Etc… • Suggestions: • Appoint “Data Stewards” • Ensure ETL programs have control checks • Data Profiling…
Data Profiling • Systematic analysis of the content of a data source • Goals: • Anticipate potential data quality issues upfront • Build quality corrections and controls into ETL process • Manual and/or Tool-assisted
Documenting ETL High Level Design • Add to existing DW Physical Design Spreadsheet
Low Level Design of ETL Process • Detailed documentation of: • What data do we need and where is it coming from? • What are the major transformation/cleansing needs? • What’s the sequence of activities for ETL? • Can use tool like SSIS
Extracting Source Data • Two forms: • Static Data Capture • Point-in-time snapshot • Initial Loads and periodic refreshes • Revised Data Capture • Only data that has been added, updated, deleted since last load • Ongoing incremental loads • Two timeframes • Immediate • Deferred
Static Data Capture • (T)SQL Scripts • e.g., small number of tables/rows • Export/Import Tables • e.g., database or non-database sources • Backup/Restore Database • e.g., copying sqlserver source database for initial load ETL • Detach/Attach Database • e.g., copying older sqlserver version to newer sqlserver version for initial load ETL
Revised Data Capture • Immediate / Real-time • ETL side: procs get changed data from log real-time and update ETL staging tables • OLTP side: triggers update ETL staging tables • OLTP side: apps write to OLTP AND ETL staging tables • Deferred • ETL side: procs get changed data from OLTP tables based on timestamps • ETL side: procs do file comparison • OLTP side: changed data capture (SS 2008)
Documenting ETL Low Level Design: SSIS • Comes with SQL Server • Helps document and automate ETL process • Based on defining • Packages • Tasks • One approach • A package for each target table • A "master" package
SSIS Package Examples: Extract Changed using CDC Eg, SELECT * from cdc-customer WHERE cdc_chg_date > etl_last_capture_date;
Class Performance DW Example • Create ClassPerformanceDW database • Using ClassPerformanceDW database… • Create ClassPerformanceDW tables using SQL Script • http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_dw_tables/create_class_performance_dw_tables.sql
ETL Example using SQL Scripts • One "Master Script" • Calls five "table" scripts
"Master" Script --be sure to turn on Query, SQLCMD mode in order to run this script Use ClassPerformanceDW print 'loading dimclass table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimclass.sql" print 'loading dimprofessor table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimprofessor.sql" print 'loading dimstudent table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimstudent.sql" print 'loading dimtime table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimtime.sql" print 'loading factenrollment table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_factenrollment.sql" Print 'class performance DW data transformation and loading is complete' Go
Load "DimProfessor" Script (pg. 1 of 3) set nocount on print 'remove existing data from dimprofessor' delete from dimprofessor; go print 'reseeding SK identity value back to 1' dbcccheckident ('dimprofessor', reseed, 0); go print 'adding oltp prof data to dimprofessor' print 'professor_sk will be automatically inserted' insert into dimprofessor ( professor_id, firstname, lastname, rank, department) select prof_id, firstname, lastname, rank, dept from regnOLTP.dbo.prof ; go
Load "DimProfessor" Script (pg. 2 of 3) print 'decoding rank field' UPDATE dimprofessor SET dimprofessor.rank = case dimprofessor.rank when 'asst' then 'assistant prof' when 'assc' then 'associate prof' when 'prof' then 'full prof' end ; Go print 'decoding department field using imported excel spreadsheet' UPDATE dimprofessor SET dimprofessor.department = regnOLTP.dbo.departments.department FROM dimprofessor, regnOLTP.dbo.departments WHERE dimprofessor.department = regnOLTP.dbo.departments.prefix ; Go
Load "DimProfessor" Script (pg. 3 of 3) print 'adding SK -1 row' set identity_insertdimprofessor on Go insert into dimprofessor ( professor_sk, professor_id, firstname, lastname, rank, department) Values (-1, -1, 'unknown', 'unknown', 'unknown', 'unknown'); GO set identity_insertdimprofessor off Go Set nocount off
Load "FactEnrollment" Script print 'adding oltp registration data to fact_enrollment' INSERT INTO factenrollment ( student_sk, class_sk, date_sk, professor_sk, course_grade) SELECT student_sk, class_sk, datekey, professor_sk, final_grade FROM ((((regnOLTP.dbo.registration INNER JOIN dimstudent ON registration.stud_id = dimstudent.student_id) INNER JOIN dimclass ON regnOLTP.dbo.registration.callno = dimclass.crn) INNER JOIN dimtime ON CONVERT(varchar(10),regnOLTP.dbo.registration.regn_date,101) = actualdatekey) INNER JOIN regnOLTP.dbo.section ON dimclass.crn = regnOLTP.dbo.section.callno) LEFT JOIN dimprofessor ON regnOLTP.dbo.section.prof_id = dimprofessor.professor_id ; Go
Entire Transform/Load "Package" http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_dw_tables.zip