1 / 40

MIS 4346/5346 Data warehousing

MIS 4346/5346 Data warehousing. Extract, Transform, Load. Agenda. Review Analysis (Bus Matrix, Info Package) Logical Design (Dimensional Modeling) Physical Design (Spreadsheet) Implementation (Data Mart Relational Tables) ETL Process Overview ETL Components Staging Area Extraction

haig
Download Presentation

MIS 4346/5346 Data warehousing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MIS 4346/5346 Data warehousing Extract, Transform, Load

  2. Agenda • Review • Analysis (Bus Matrix, Info Package) • Logical Design (Dimensional Modeling) • Physical Design (Spreadsheet) • Implementation (Data Mart Relational Tables) • ETL Process Overview • ETL Components • Staging Area • Extraction • Transformation • Loading • Documenting High-Level ETL Requirements • Documenting Detailed ETL Flows • Example ETL

  3. Review: Dimensional Modeling

  4. Review: DM Implementation DimStudent FactEnrollment CREATE TABLE DimStudent( student_skintidentity(1,1), student_idvarchar(9), firstnamevarchar(30), lastnamevarchar(30), city varchar(20), state varchar(2), major varchar(6), classification varchar(25), gpa numeric(3, 2), club_namevarchar(25), undergrad_schoolvarchar(25), gmatint, undergrad_or_gradvarchar(10), CONSTRAINT dimstudent_pk PRIMARY KEY (student_sk)); GO CREATE TABLE FactEnrollment( student_skint, class_skint, date_skint, professor_skint, course_grade numeric(2, 1), CONSTRAINT factenrollment_pk PRIMARY KEY (student_sk, class_sk, date_sk, professor_sk), CONSTRAINT factenrollment_student_fk FOREIGN KEY (student_sk) REFERENCES dimstudent(student_sk), CONSTRAINT factenrollment_class_fk FOREIGN KEY(class_sk) REFERENCES dimclass (class_sk), CONSTRAINT factenrollment_date_fk FOREIGN KEY(date_sk) REFERENCES dimtime (date_sk), CONSTRAINT factenrollment_professor_fk FOREIGN KEY(professor_sk) REFERENCES dimprofessor (professor_sk)); GO

  5. Review: Physical DW Design

  6. ETL Overview • Reshaping relevant data from source systems into useful information stored in the DW • Extract • Copying and integrating data from OLTP and other data sources in preparation for cleansing and loading into the DW • Transform • Cleaning and converting data to prepare it for loading into the DW • Load • Putting cleansed and converted data into the DW

  7. ETL Process • Not Really New, BUT… • Much more data • Includes rearranging, summarizing • Data used for strategic decision-making • Characteristics: • Process AND technology • Detailed, highly-dependent tasks • Consumes average 75% of DW development • An on-going process for life of DW • Requirements: • Well-documented • Automated • Flexible

  8. ETL Process • Determine target data • Determine data sources • Prepare data mapping • Organize data staging area • Establish data extraction rules • Establish data transformation rules • Plan aggregate tables • Establish data load procedures • Load dimension tables • Load fact tables

  9. ETL Process Flow 3, Spreadsheet 1, Dim Model 2, Spreadsheet 5, SSIS 8, 9, 10, SSIS 6, 7, Map & SSIS 4

  10. ETL Staging Area • Information hub,facilitating the enriching stages that data goes through to populate a DW • Advantages: • Separates source systems and DW • Minimizes ETL impact on source AND DW systems • Can consist of multiple “hubs” • “upload” area • “staging” area • “DW load images”

  11. ETL Staging Area, cont…

  12. High Level Design of ETL Process • Initial documentation of: • What data do we need and where is it coming from? • Physical DW Design Spreadsheet shown previously • What are the major transformation/cleansing needs? • “Extend” Physical DW Design Spreadsheet OR • ETL Map • What’s the sequence of activities for ETL? • ETL Map

  13. Common Transformations • Format Revisions • Key Restructuring, Lookup • Handling of Null Values • Decoding fields • Calculated, Derived values • Merging of Data

  14. Common Transformations, cont… • Splitting of single fields • Character set conversion • Units of measurement conversion • Date/time conversion • Summarization • Deduplication

  15. Common Transformations, cont… • Other Data Quality Issues • Standardize values • Validate values • Identifying mismatches, misspellings • Etc… • Suggestions: • Appoint “Data Stewards” • Ensure ETL programs have control checks • Data Profiling…

  16. Comparison of Models

  17. Transformations Example

  18. Data Profiling • Systematic analysis of the content of a data source • Goals: • Anticipate potential data quality issues upfront • Build quality corrections and controls into ETL process • Manual and/or Tool-assisted

  19. Profiling Example: Manual

  20. Profiling Example: SSIS

  21. Documenting ETL High Level Design • Add to existing DW Physical Design Spreadsheet

  22. Documenting ETL High Level Design

  23. Low Level Design of ETL Process • Detailed documentation of: • What data do we need and where is it coming from? • What are the major transformation/cleansing needs? • What’s the sequence of activities for ETL? • Can use tool like SSIS

  24. Extracting Source Data • Two forms: • Static Data Capture • Point-in-time snapshot • Initial Loads and periodic refreshes • Revised Data Capture • Only data that has been added, updated, deleted since last load • Ongoing incremental loads • Two timeframes • Immediate • Deferred

  25. Static Data Capture • (T)SQL Scripts • e.g., small number of tables/rows • Export/Import Tables • e.g., database or non-database sources • Backup/Restore Database • e.g., copying sqlserver source database for initial load ETL • Detach/Attach Database • e.g., copying older sqlserver version to newer sqlserver version for initial load ETL

  26. Revised Data Capture • Immediate / Real-time • ETL side: procs get changed data from log real-time and update ETL staging tables • OLTP side: triggers update ETL staging tables • OLTP side: apps write to OLTP AND ETL staging tables • Deferred • ETL side: procs get changed data from OLTP tables based on timestamps • ETL side: procs do file comparison • OLTP side: changed data capture (SS 2008)

  27. Documenting ETL Low Level Design: SSIS • Comes with SQL Server • Helps document and automate ETL process • Based on defining • Packages • Tasks • One approach • A package for each target table • A "master" package

  28. SSIS Package Examples: Master

  29. SSIS Package Examples: Extract All

  30. SSIS Package Examples: Extract Changed using CDC Eg, SELECT * from cdc-customer WHERE cdc_chg_date > etl_last_capture_date;

  31. SSIS Package Examples: Transforms

  32. SSIS Package Examples: Load

  33. Class Performance DW Example • Create ClassPerformanceDW database • Using ClassPerformanceDW database… • Create ClassPerformanceDW tables using SQL Script • http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_dw_tables/create_class_performance_dw_tables.sql

  34. ETL Example using SQL Scripts • One "Master Script" • Calls five "table" scripts

  35. "Master" Script --be sure to turn on Query, SQLCMD mode in order to run this script Use ClassPerformanceDW print 'loading dimclass table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimclass.sql" print 'loading dimprofessor table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimprofessor.sql" print 'loading dimstudent table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimstudent.sql" print 'loading dimtime table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimtime.sql" print 'loading factenrollment table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_factenrollment.sql" Print 'class performance DW data transformation and loading is complete' Go

  36. Load "DimProfessor" Script (pg. 1 of 3) set nocount on print 'remove existing data from dimprofessor' delete from dimprofessor; go print 'reseeding SK identity value back to 1' dbcccheckident ('dimprofessor', reseed, 0); go print 'adding oltp prof data to dimprofessor' print 'professor_sk will be automatically inserted' insert into dimprofessor ( professor_id, firstname, lastname, rank, department) select prof_id, firstname, lastname, rank, dept from regnOLTP.dbo.prof ; go

  37. Load "DimProfessor" Script (pg. 2 of 3) print 'decoding rank field' UPDATE dimprofessor SET dimprofessor.rank = case dimprofessor.rank when 'asst' then 'assistant prof' when 'assc' then 'associate prof' when 'prof' then 'full prof' end ; Go print 'decoding department field using imported excel spreadsheet' UPDATE dimprofessor SET dimprofessor.department = regnOLTP.dbo.departments.department FROM dimprofessor, regnOLTP.dbo.departments WHERE dimprofessor.department = regnOLTP.dbo.departments.prefix ; Go

  38. Load "DimProfessor" Script (pg. 3 of 3) print 'adding SK -1 row' set identity_insertdimprofessor on Go insert into dimprofessor ( professor_sk, professor_id, firstname, lastname, rank, department) Values (-1, -1, 'unknown', 'unknown', 'unknown', 'unknown'); GO set identity_insertdimprofessor off Go Set nocount off

  39. Load "FactEnrollment" Script print 'adding oltp registration data to fact_enrollment' INSERT INTO factenrollment ( student_sk, class_sk, date_sk, professor_sk, course_grade) SELECT student_sk, class_sk, datekey, professor_sk, final_grade FROM ((((regnOLTP.dbo.registration INNER JOIN dimstudent ON registration.stud_id = dimstudent.student_id) INNER JOIN dimclass ON regnOLTP.dbo.registration.callno = dimclass.crn) INNER JOIN dimtime ON CONVERT(varchar(10),regnOLTP.dbo.registration.regn_date,101) = actualdatekey) INNER JOIN regnOLTP.dbo.section ON dimclass.crn = regnOLTP.dbo.section.callno) LEFT JOIN dimprofessor ON regnOLTP.dbo.section.prof_id = dimprofessor.professor_id ; Go

  40. Entire Transform/Load "Package" http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_dw_tables.zip

More Related