1 / 71

Presentation # 506

IOUG Live! 2005. Design Tips for the Warehouse Architect. Presentation # 506. David Stanford President Red Sky Data Inc. david.stanford@redskydata.com. Objectives. Obtain a clear understanding of data warehouse design ‘hot points’ Identify solutions and alternatives for these ‘hot points’

dalton
Download Presentation

Presentation # 506

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IOUG Live! 2005 Design Tips for the Warehouse Architect Presentation # 506 David Stanford President Red Sky Data Inc. david.stanford@redskydata.com

  2. Objectives • Obtain a clear understanding of data warehouse design ‘hot points’ • Identify solutions and alternatives for these ‘hot points’ • See how real world solutions are implemented

  3. Agenda • Top 10 Gotchya’s • Design Traps • Loading Dirty Fact Data • Surrogate Keys • The Staging Area • Slowly Changing Dimensions • Tracking All History • Audit Considerations • Bad & Missing Data • Administrative Fields • Other Tidbits of Advice

  4. Dave’s Top 10 Gotchya’s • Failing to model for both a) view of the data when the event occurred and b) view of the data as of today’s reality • Limiting the number of dimensions • Failing to model and populate a meta data repository • Failing to provide sufficient audit capabilities to verify loads against source systems • Not using surrogate keys for everything

  5. Dave’s Top 10 Gotchya’s • Failing to design an error correction process • Normalizing too much • Not using a staging area • Failing to load ALL of the fact data • Failing to classify incorrect data • Making it too complex!

  6. Design Traps Design Review Staging Area Surrogate Keys Facts – Surrogates and Dirty Data

  7. Data Warehouse Process Data Characteristics • Raw Detail • No/Minimal History • Integrated • Scrubbed • History • Summaries • Targeted • Specialized (OLAP) Source OLTP Systems Data Marts Staging Area Data Warehouse • Design • Mapping • Extract • Scrub • Transform • Load • Index • Aggregation • Replication • Data Set Distribution • Access & Analysis • Resource Scheduling & Distribution Meta Data System Monitoring Source: Enterprise Group

  8. Where The Work Is Source OLTP Systems Data Marts Over 80% of the work is here Staging Area Data Warehouse • Design • Mapping • Extract • Scrub • Transform • Load • Index • Aggregation • Replication • Data Set Distribution • Access & Analysis • Resource Scheduling & Distribution Meta Data System Monitoring Source: Enterprise Group

  9. Warehouse Design • Normalized (Relational) Design • Dimensional Design – Star and Snowflake • Hybrid Design • In reality, the DW is more normalized but has elements of dimensional design • The data marts are star schemas but have elements of normalization

  10. Modelling is not straight forward Marital Status Age Member Donation Time Gender Campaign Income Location

  11. Should These Be Combined? MaritalStatus Age Member Donation Time Gender Campaign Income Location

  12. Behind The Scenes • There are several aspects of a design that users don’t directly see: • Meta Data • Error Correction • Audit • Load Control (if not using a scheduling tool) • Transformation Tables (used for transforming the data prior to being loaded into the DW)

  13. Error Correction Audit Load Control Transform Tables Meta Data Behind The Scenes Data Marts Source OLTP Systems Staging Area Data Warehouse

  14. A 10 Step Design Process • Identify major subject areas or topics • Declare the Grain • Add element of time to the tables • Create appropriate names for tables, columns, and views • Add derived fields where applicable • Add administrative fields • Consider security and privacy in design • Make sure data model answers the critical business questions • Consider meta data • Consider error correction • Performance considerations: Tune, Tune, Tune

  15. Independent of Approach… …the goal of the data model is to satisfy two primary criteria: 1. Meet Business Objectives 2. Provide Good Performance

  16. Staging Area

  17. Staging Area Data Characteristics • Raw Detail • No/Minimal History • Integrated • Scrubbed • History • Summaries • Targeted • Specialized (OLAP) Source OLTP Systems Data Marts Staging Area Data Warehouse • Design • Mapping • Extract • Scrub • Transform • Load • Index • Aggregation • Replication • Data Set Distribution • Access & Analysis • Resource Scheduling & Distribution Meta Data System Monitoring Source: Enterprise Group

  18. The Staging Area • Holds a mirror copy of the extract files • Allows pre-processing of the data before loading • Allows easier reloading (you WILL do this) • Keeps more control with the DW team, rather than with an external group (the extract team)

  19. The Staging Area • Facilitates easier audit processes • Can facilitate error correction processes • Helps identifying the Record Type (translates into easier ETL processing and logic)

  20. Surrogate Keys

  21. Surrogate Keys • A surrogate key is a system generated, unintelligent, single column, unique identifier for each row within a table • Always use surrogate keys for dimensions • Always use surrogate keys for the time dimension • Always use surrogate keys for transformation tables • Always use surrogate keys for EVERY table • ..and this includes FACT tables

  22. Surrogate Keys Avoid… • Duplicate keys from different source systems • Recycling of primary keys • Use of the same key for different business rows • Lengthy composite key joins • Space in fact tables • Application changes or upgrades in source systems

  23. Using Surrogates In Fact Tables • You will need a surrogate key on the fact table if you allow ‘unknown’ values into the fact table (which is recommended by the way) • The Primary Key of a fact table is typically the combination of the base dimensions

  24. Surrogates In Fact Tables

  25. Surrogates In Fact Tables

  26. Surrogates In Fact Tables

  27. Surrogates In Fact Tables This results in a duplicate primary key in the table

  28. Surrogates In Fact Tables Thus the need for a surrogate primary key

  29. Load “Dirty” Data Into The Fact • Ties out to source systems • Gains credibility with end users • Requires a few design resolutions: • Bad & Missing (BAM) Logic • Surrogate Keys in the Fact tables • Still 100% accurate – we don’t load the bad values, we identify the bad values for correction later • Empowers the End Users to decide if the “dirty” data will invalidate their analysis

  30. Tracking History

  31. Tracking History in Dimensions • Type 1 – No history • Type 2 – All history • Type 3 – Some history

  32. Type 1 – No History

  33. Type 1 – No History

  34. Type 2 – All History

  35. Type 3 – Some History

  36. Type 3 – Some History

  37. More Dimension Types…Combinations • Type 3 Prime – Types 1 and 2 (the most common) • Type 4 – Types 1 and 3 • Type 5 – Types 2 & 3 • Type 6 – Types 1, 2, and 3 (the second most common)

  38. Trigger Fields • Trigger Fields are fields within a table that you want to track history • Non-Trigger fields are those which you do not want to track history

  39. Type 3 Prime –All and No History Non Trigger Fields Trigger Field

  40. Non Trigger Field Update

  41. Trigger & Non Trigger Field Update

  42. Changes One At A Time

  43. Expect To Track Everything • Users want to view the data as it was when the transaction or event occurred AND… • Users want to view the data in the context of today’s realities THUS, model for both!

  44. In order to provide these two views, consider adding ‘current’ columns to tables. This is a special Type 6. These fields get updated in historical records when a trigger field changes value in the current record. This simplifies the use of the DW by the users It’s easier to understand than having to write complex SQL Add ‘Current’ Columns

  45. Type 6 – All, Some, and No History

  46. Most Recent Flag • Tracks the Most Recent record in time (not loaded, but based on a time series) • Should be added to the dimensions as a Yes/No (1/0) field • The most recently loaded record is set to Yes, all other records are set to No • Allows user to restrict on the Most Recent Flag to get a view of the world today

  47. Double Keying Type 2 Dimensions • Double surrogate key in Type 2 dimensions • 1 key is unique for each individual row • 1 key is unique for each individual business key • Protects against: • Authoritative source system changes / duplication

  48. Rapidly Changing Dimensions • Rapidly Changing Dimensions (RCD’s) need to be partitioned • Use Oracle partitioning • Include the native partition key in the dimension • Or split into several tables

  49. BAM Rules, Audit & Administrative Fields

  50. Bad & Missing Fact Data • Bad and/or missing data will be always be an issue • The source data is never completely clean • There are always exceptions • Recall that you need to tie back into the source systems for your audit, thus you must load this ‘incorrect’ data • Put the decisions into the hands of your users – don’t decide for them whether the data is good enough or not • Need to develop Bad & Missing (BAM) Rules

More Related