1 / 48

Building the Data Warehouse: Transforming Data

Building the Data Warehouse: Transforming Data. Objectives. After completing this lesson, you should be able to do the following: Define transformation Identify possible staging models Identify data anomalies and eliminate them Explain the importance of quality data

jescie-soto
Download Presentation

Building the Data Warehouse: Transforming Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building the Data Warehouse: Transforming Data

  2. Objectives • After completing this lesson, you should be able to do the following: • Define transformation • Identify possible staging models • Identify data anomalies and eliminate them • Explain the importance of quality data • Describe techniques for transforming data • Design transformation process • List Oracle’s enhanced features and tools that can be used to transform data

  3. Transformation • Transformation eliminates anomalies from operational data: • Cleans and standardizes • Presents subject-oriented data Transform: Clean up Consolidate Restructure Extract Warehouse Operationalsystems Load Data Staging Area

  4. Possible Staging Models • Remote staging model • Onsite staging model

  5. Operationalsystem Operationalsystem Warehouse Warehouse Remote Staging Model Data staging area within the warehouse environment Transform Extract Load Staging area Data staging area in its own environment Transform Extract Load Staging area

  6. On-site Staging Model • Data staging area within the operational environment,possibly affecting the operational system Transform Extract Load Operational system Staging area Warehouse

  7. Data Anomalies • No unique key • Data naming and coding anomalies • Data meaning anomalies between groups • Spelling and text inconsistencies

  8. Transformation Routines • Cleaning data • Eliminating inconsistencies • Adding elements • Merging data • Integrating data • Transforming data before load

  9. Transforming Data: Problems and Solutions • Multipart keys • Multiple local standards • Multiple files • Missing values • Duplicate values • Element names • Element meanings • Input formats • Referential Integrity constraints • Name and address

  10. Product code = 12M654313 45 Salesperson code Country code Sales territory Productnumber Multipart Keys Problem • Multipart keys

  11. Multiple Local Standards Problem • Multiple local standards • Tools or filters to preprocess cm DD/MM/YY 1,000 GBP inches MM/DD/YY FF 9,990 cm DD-Mon-YY USD 600

  12. Multiple Files Problem • Added complexity of multiple source files • Start simple Multiple source files Logic to detectcorrect source Transformed data

  13. Missing Values Problem • Solution: • Ignore • Wait • Mark rows • Extract when time-stamped If NULL thenfield = ‘A’

  14. Duplicate Values Problem • Solution: • SQL self-join techniques • RDMBS constraint utilities ACME Inc ACME Inc ACME Inc SQL> SELECT ... 2 FROM table_a, table_b 3 WHERE table_a.key (+)= table_b.key 4 UNION 5 SELECT ... 6 FROM table_a, table_b 7 WHERE table_a.key = table_b.key (+);

  15. Element Names Problem • Solution: • Common naming conventions Customer Client Customer Contact Name

  16. Element Meaning Problem • Avoid misinterpretation • Complex solution • Document meaning in metadata Customer’s name All customer details All details except name Customer_detail

  17. Input Format Problem EBCDIC ASCII “123-73” 12373 ACME Co. áøåëéí äáàéí Beer (Pack of 8)

  18. Referential Integrity Problem • Solution: • SQL anti-join • Server constraints • Dedicated tools

  19. Name and Address Problem • Single-field format • Multiple-field format Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565

  20. Name and Address Processing in Oracle9i Warehouse Builder • Name and address mapping operator supports: • Parsing • Standardization • Postal matching and geocoding

  21. Quality Data: Importance and Benefits • Quality data: • Key to a successful warehouse implementation • Quality data helps you in: • Targeting right customers • Determining buying patterns • Identifying householders: private and commercial • Matching customers • Identify historical data

  22. Quality: Standards and Improvements • Setting standards: • Define a quality strategy • Decide on optimal data-quality level • Improving operational data quality: • Consider modifying rules for operational data • Document the sources • Create a data stewardship program • Design the cleanup process carefully • Initial cleanup and refresh routines may differ

  23. Data Quality Guidelines • Operational data: • Should not be used directly in the warehouse • Must be cleaned for each increment • Is not simply fixed by modifying applications

  24. Data Quality: Solutions and Management • Solutions: • COBOL, Java, 4GL • Specialized tools • Customized data conversion process • Investigation • Conditioning and Standardization • Integration • Management: • Take responsibility • Resolve problems • Data quality manager

  25. Transformation Techniques • Merging data • Adding a Date Stamp • Adding Keys to Data

  26. Merging Data • Operational transactions do not usually map one-to-one with warehouse data. • Data for the warehouse is merged to provide information for analysis.

  27. Merging Data

  28. Adding a Date Stamp • Time element can be represented as a: • Single point in time • Time span • Add time element to: • Fact tables • Dimension data

  29. Product Table Product_id Time_key Product_desc Store Table Store_id District_id Time_key Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units Time Table Week_id Period_id Year_id Time_key Item Table Item_id Dept_id Time_key Adding a Date Stamp:Fact Tables and Dimensions

  30. #1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00 #4 Return 1/2/98 12:00:03 Anchovy Pizza - $12.00 #5 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 Data values or artificial keys #dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #dw2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 Adding Keys to Data

  31. Summarizing Data 1. During extraction on staging area 2. After loading to the warehouse server Operational databases Staging area Warehouse database

  32. Sources Stage Rules Publish Extract Transform Load Query Maintaining Transformation Metadata • Transformation metadata contains: • Transformation rules • Algorithms and routines

  33. Maintaining Transformation Metadata • Restructure keys • Identify and resolve coding differences • Validate data from multiple sources • Handle exception rules • Identify and resolve format differences • Fix referential integrity inconsistencies • Identify summary data

  34. Data Ownership and Responsibilities • Data ownership and responsibilities should be shared by the: • Operational team • Data warehouse team • Business benefit gained with “work together” approach

  35. Transformation Timing and Location • Transformation is performed: • Before load • In parallel • Can be initiated at different points: • On the operational platform • In a separate staging area

  36. Choosing a Transformation Point • Workload • Impact on environment • CPU usage • Disk space • Network bandwidth • Parallel execution • Load window time • User information needs

  37. Monitoring and Tracking • Transformations should: • Be self-documenting • Provide summary statistics • Handle process exceptions

  38. Designing Transformation Processes • Analysis: • Sources and target mappings, business rules • Key users, metadata, grain • Design options: • Third-party tools • Custom 3GL programs • 4GLs like SQL or PL/SQL • Replication • Design issues: • Performance • Size of the staging area • Exception handling, integrity maintenance

  39. Transformation Tools • Third-party tools • SQL*Loader • In-house developed programs

  40. Datawarehouse Oracle’s Enhanced Featuresfor Transformation • Transformation methods Loading intostaging tables Stagingtable 1 Flat Files Transformdata Validatedata Stagingtable 2 Merge intowarehouse tables Stagingtable 2 Multi stage Transformation

  41. External tables Externaltable Flat Files Merge intowarehouse tables Warehousetables Table functions Validatedata Transformdata Oracle’s Enhanced Featuresfor Transformation • Transformation methods Pipelined Transformation

  42. Oracle’s Enhanced Featuresfor Transformation • Transformation mechanisms • Using SQL: • CREATE TABLES AS SELECT (CTAS) • UPDATE • MERGE • Multitable INSERT Cust Customer Existingrowupdated Merge New rowinserted

  43. Condition Oracle’s Enhanced Featuresfor Transformation • Transformation mechanisms • Multitable INSERT Sourcetable Targettable 2 Targettable 3 Targettable 1

  44. Oracle’s Enhanced Featuresfor Transformation • Transformation mechanisms (continued) • Using PL/SQL: • Used for complex transformations • Using Table Functions: Table Functions can: • Return multiple rows from a function • Accept results of multiple row SQL subqueries as input • Take cursors as input • Be parallelized • Support incremental pipelining

  45. Summary • In this lesson, you should have learned how to: • Define transformation • Identify possible staging models • Identify data anomalies and eliminate them • Explain the importance of quality data • Describe techniques for transforming data • Design transformation process • List Oracle’s enhanced features and tools that can be used to transform data

  46. Practice 6-1 Overview • This practice covers the following topics: • Answering a series of questions based on the business scenario for Frontier Airways • Answering a series of short questions

More Related