710 likes | 856 Views
IOUG Live! 2005. Design Tips for the Warehouse Architect. Presentation # 506. David Stanford President Red Sky Data Inc. david.stanford@redskydata.com. Objectives. Obtain a clear understanding of data warehouse design ‘hot points’ Identify solutions and alternatives for these ‘hot points’
E N D
IOUG Live! 2005 Design Tips for the Warehouse Architect Presentation # 506 David Stanford President Red Sky Data Inc. david.stanford@redskydata.com
Objectives • Obtain a clear understanding of data warehouse design ‘hot points’ • Identify solutions and alternatives for these ‘hot points’ • See how real world solutions are implemented
Agenda • Top 10 Gotchya’s • Design Traps • Loading Dirty Fact Data • Surrogate Keys • The Staging Area • Slowly Changing Dimensions • Tracking All History • Audit Considerations • Bad & Missing Data • Administrative Fields • Other Tidbits of Advice
Dave’s Top 10 Gotchya’s • Failing to model for both a) view of the data when the event occurred and b) view of the data as of today’s reality • Limiting the number of dimensions • Failing to model and populate a meta data repository • Failing to provide sufficient audit capabilities to verify loads against source systems • Not using surrogate keys for everything
Dave’s Top 10 Gotchya’s • Failing to design an error correction process • Normalizing too much • Not using a staging area • Failing to load ALL of the fact data • Failing to classify incorrect data • Making it too complex!
Design Traps Design Review Staging Area Surrogate Keys Facts – Surrogates and Dirty Data
Data Warehouse Process Data Characteristics • Raw Detail • No/Minimal History • Integrated • Scrubbed • History • Summaries • Targeted • Specialized (OLAP) Source OLTP Systems Data Marts Staging Area Data Warehouse • Design • Mapping • Extract • Scrub • Transform • Load • Index • Aggregation • Replication • Data Set Distribution • Access & Analysis • Resource Scheduling & Distribution Meta Data System Monitoring Source: Enterprise Group
Where The Work Is Source OLTP Systems Data Marts Over 80% of the work is here Staging Area Data Warehouse • Design • Mapping • Extract • Scrub • Transform • Load • Index • Aggregation • Replication • Data Set Distribution • Access & Analysis • Resource Scheduling & Distribution Meta Data System Monitoring Source: Enterprise Group
Warehouse Design • Normalized (Relational) Design • Dimensional Design – Star and Snowflake • Hybrid Design • In reality, the DW is more normalized but has elements of dimensional design • The data marts are star schemas but have elements of normalization
Modelling is not straight forward Marital Status Age Member Donation Time Gender Campaign Income Location
Should These Be Combined? MaritalStatus Age Member Donation Time Gender Campaign Income Location
Behind The Scenes • There are several aspects of a design that users don’t directly see: • Meta Data • Error Correction • Audit • Load Control (if not using a scheduling tool) • Transformation Tables (used for transforming the data prior to being loaded into the DW)
Error Correction Audit Load Control Transform Tables Meta Data Behind The Scenes Data Marts Source OLTP Systems Staging Area Data Warehouse
A 10 Step Design Process • Identify major subject areas or topics • Declare the Grain • Add element of time to the tables • Create appropriate names for tables, columns, and views • Add derived fields where applicable • Add administrative fields • Consider security and privacy in design • Make sure data model answers the critical business questions • Consider meta data • Consider error correction • Performance considerations: Tune, Tune, Tune
Independent of Approach… …the goal of the data model is to satisfy two primary criteria: 1. Meet Business Objectives 2. Provide Good Performance
Staging Area Data Characteristics • Raw Detail • No/Minimal History • Integrated • Scrubbed • History • Summaries • Targeted • Specialized (OLAP) Source OLTP Systems Data Marts Staging Area Data Warehouse • Design • Mapping • Extract • Scrub • Transform • Load • Index • Aggregation • Replication • Data Set Distribution • Access & Analysis • Resource Scheduling & Distribution Meta Data System Monitoring Source: Enterprise Group
The Staging Area • Holds a mirror copy of the extract files • Allows pre-processing of the data before loading • Allows easier reloading (you WILL do this) • Keeps more control with the DW team, rather than with an external group (the extract team)
The Staging Area • Facilitates easier audit processes • Can facilitate error correction processes • Helps identifying the Record Type (translates into easier ETL processing and logic)
Surrogate Keys • A surrogate key is a system generated, unintelligent, single column, unique identifier for each row within a table • Always use surrogate keys for dimensions • Always use surrogate keys for the time dimension • Always use surrogate keys for transformation tables • Always use surrogate keys for EVERY table • ..and this includes FACT tables
Surrogate Keys Avoid… • Duplicate keys from different source systems • Recycling of primary keys • Use of the same key for different business rows • Lengthy composite key joins • Space in fact tables • Application changes or upgrades in source systems
Using Surrogates In Fact Tables • You will need a surrogate key on the fact table if you allow ‘unknown’ values into the fact table (which is recommended by the way) • The Primary Key of a fact table is typically the combination of the base dimensions
Surrogates In Fact Tables This results in a duplicate primary key in the table
Surrogates In Fact Tables Thus the need for a surrogate primary key
Load “Dirty” Data Into The Fact • Ties out to source systems • Gains credibility with end users • Requires a few design resolutions: • Bad & Missing (BAM) Logic • Surrogate Keys in the Fact tables • Still 100% accurate – we don’t load the bad values, we identify the bad values for correction later • Empowers the End Users to decide if the “dirty” data will invalidate their analysis
Tracking History in Dimensions • Type 1 – No history • Type 2 – All history • Type 3 – Some history
More Dimension Types…Combinations • Type 3 Prime – Types 1 and 2 (the most common) • Type 4 – Types 1 and 3 • Type 5 – Types 2 & 3 • Type 6 – Types 1, 2, and 3 (the second most common)
Trigger Fields • Trigger Fields are fields within a table that you want to track history • Non-Trigger fields are those which you do not want to track history
Type 3 Prime –All and No History Non Trigger Fields Trigger Field
Expect To Track Everything • Users want to view the data as it was when the transaction or event occurred AND… • Users want to view the data in the context of today’s realities THUS, model for both!
In order to provide these two views, consider adding ‘current’ columns to tables. This is a special Type 6. These fields get updated in historical records when a trigger field changes value in the current record. This simplifies the use of the DW by the users It’s easier to understand than having to write complex SQL Add ‘Current’ Columns
Most Recent Flag • Tracks the Most Recent record in time (not loaded, but based on a time series) • Should be added to the dimensions as a Yes/No (1/0) field • The most recently loaded record is set to Yes, all other records are set to No • Allows user to restrict on the Most Recent Flag to get a view of the world today
Double Keying Type 2 Dimensions • Double surrogate key in Type 2 dimensions • 1 key is unique for each individual row • 1 key is unique for each individual business key • Protects against: • Authoritative source system changes / duplication
Rapidly Changing Dimensions • Rapidly Changing Dimensions (RCD’s) need to be partitioned • Use Oracle partitioning • Include the native partition key in the dimension • Or split into several tables
Bad & Missing Fact Data • Bad and/or missing data will be always be an issue • The source data is never completely clean • There are always exceptions • Recall that you need to tie back into the source systems for your audit, thus you must load this ‘incorrect’ data • Put the decisions into the hands of your users – don’t decide for them whether the data is good enough or not • Need to develop Bad & Missing (BAM) Rules