940 likes | 1.07k Views
International Oracle Users Group. April 15, 2002. The Warehouse Designer’s School Of Hard Knocks. The Data Warehouse Designer’s School of Hard Knocks. A Graduate’s Perspective. David Stanford Sr. Vice President Cognicase Inc. david.stanford@cognicase.com. Objectives.
E N D
International Oracle Users Group April 15, 2002 The Warehouse Designer’s School Of Hard Knocks The Data Warehouse Designer’s School of Hard Knocks A Graduate’s Perspective David Stanford Sr. Vice President Cognicase Inc. david.stanford@cognicase.com
Objectives • Obtain a clear understanding of data warehouse design ‘hot points’ • Identify solutions and alternatives for these ‘hot points’ • See how real world solutions are implemented
Agenda • Top 10 Gotchya’s • Warehouse Design • Surrogate Keys • Tracking History • Row Level Security In The Warehouse • BAM Rules, Audit and Administrative Fields • Other Tidbits of Advice
Dave’s Top 10 Gotchya’s • Failing to model for both a) view of the data when the event occurred and b) view of the data as of today’s reality • Limiting the number of dimensions • Failing to model and populate a meta data repository • Failing to provide sufficient audit capabilities to verify loads against source systems • Not using surrogate keys for everything
Dave’s Top 10 Gotchya’s (cont’d) • Failing to design an error correction process • Normalizing too much • Not using a staging area • Failing to load ALL of the fact data • Failing to classify incorrect data
Overall Architecture • At the 20,000 foot level we must decide on the use of: • The Operational Data Store (ODS) • The Staging Area • The Data Warehouse proper • The Data Mart(s) • All of which are being coined as “The Corporate Information Factory” or CIF
Design Considerations • The ODS vs the Warehouse • The ODS vs the Staging Area • The Warehouse vs the Data Mart • The lines become blurred • Plan for the world, design for the future, and build for today
Users HR Source OLTP Systems Sales Management Manufacturing Not Designing For The Future… Stove Pipe data marts...
Constituency …leads to trouble …become legamarts
Basic Data Warehouse Architecture Source OLTP Systems Staging Area Data Warehouse
Data Mart DW Architecture Data Marts Source OLTP Systems Staging Area Data Warehouse Source: Enterprise Group
Data Warehouse Process Data Characteristics • Raw Detail • No/Minimal History • Integrated • Scrubbed • History • Summaries • Targeted • Specialized (OLAP) Source OLTP Systems Data Marts Staging Area Data Warehouse • Design • Mapping • Extract • Scrub • Transform • Load • Index • Aggregation • Replication • Data Set Distribution • Access & Analysis • Resource Scheduling & Distribution Meta Data System Monitoring Source: Enterprise Group
Where The Work Is Source OLTP Systems Data Marts Over 80% of the work is here Staging Area Data Warehouse • Design • Mapping • Extract • Scrub • Transform • Load • Index • Aggregation • Replication • Data Set Distribution • Access & Analysis • Resource Scheduling & Distribution Meta Data System Monitoring Source: Enterprise Group
Legacy Legacy MDB Data Warehouse Architecture ETL Tool Warehouse Central Staging Admin. Data Area Warehouse Mid Mid - - Relational Relational Tier Tier Data Local Transform Metadata Extract and Load Process ERP Data Mid - Central Mart Tier e - Commerce Meta Data RDBMS Local Metadata Metadata Exchange External Data Data Data Data Data Cleansing Modeling Modeling Mart Mart Tool Tool Tool Data Local Local Cleansing RDBMS RDBMS MDB MDB Metadata Metadata Tool Source Data Data Data Transforma - Central Data Architected Data Access Databases Extraction Staging tion & Load Warehouse Data Marts and Analysis Components of a Data Warehousing Architecture Components of a Data Warehousing Architecture
The Staging Area • Holds a mirror copy of the extract files • Allows pre-processing of the data before loading • Allows easier reloading (you WILL do this) • Keeps more control with the DW team, rather than an external group (the extract team) • Facilitates easier audit processes • Can facilitate error correction processes
Modelling is not straight forward Marital Status Age Member Donation Time Gender Campaign Income Location
Should These Dimensions Be Combined? MaritalStatus Age Member Donation Time Gender Campaign Income Location
The 10 Step Process-Data Model Design • Identify major subject areas or topics • Add element of time to the tables • Create appropriate names for tables, columns, and views • Add derived fields where applicable • Add administrative fields • Consider security and privacy in design • Make sure data model answers the critical business questions • Consider meta data • Consider error correction • Performance considerations: Tune, Tune, Tune
Independent of Approach… …the goal of the data model is to satisfy two primary criteria: 1. Meet Business Objectives 2. Provide Good Performance
Warehouse Design • Normalized (Relational) Design • Dimensional Design • Hybrid Design • Behind the Scenes
Normalized/Relational Schema • Usually As Normalized as Possible • Used mostly in OLTP databases • Uses entities and relations to describe data • Fast for Inserts and Updates
The Star Schema STAR Schema • Used in OLAP (BI) and DWH • Uses FACT and DIMENSION Tables • Normalized FACT table • Dimensions Denormalized
Sample Star Schema Dimension Fact Table
Snowflake Schema • Contains FACT and DIMENSION Tables • Dimension Tables can be FACT for other STAR • Dimension Hierarchies are normalized
Sample Snowflake Schema SnowFlake Dimensions
Hybrid • In reality, the DW is more normalized but has elements of dimensional design • The data marts are star schemas but have elements of normalization
Behind The Scenes • There are several aspects of a design that users don’t directly see: • Meta Data • Error Correction • Audit • Load Control (if not using a scheduling tool) • Transformation Tables (used for transforming the data prior to being loaded into the DW)
Error Correction Audit Load Control Transform Tables Meta Data Behind The Scenes Data Marts Source OLTP Systems Staging Area Data Warehouse
Surrogate Keys • A surrogate key is a single column, unique identifier for each row within a table • Always use surrogate keys for dimensions • Always use surrogate keys for the time dimension • Always use surrogate keys for facts • Always use surrogate keys for transformation tables • Always use surrogate keys for EVERY table
Surrogate Keys Avoid… • Duplicate keys from different source systems • Recycling of primary keys • Use of the same key for different business rows • Lengthy composite key joins • Space in fact tables • Application changes or upgrades in source systems
Using Surrogates In Fact Tables • You will need a surrogate key on the fact table if you allow ‘unknown’ values into the fact table (which is recommended by the way) • The PK of a fact is typically the combination of the base dimensions
Surrogates In Fact Tables • This results in a duplicate primary key in the table
Surrogates In Fact Tables • Thus the need for a surrogate primary key
Tracking History in Dimensions • Type 1 – No history • Type 2 – All history • Type 3 – Some history
More Dimension Types…Combinations • Type 3 Prime – Types 1 and 2 (the most common) • Type 4 – Types 1 and 3 • Type 5 – Types 2 & 3 • Type 6 – Types 1, 2, and 3 (the second most common)
Trigger Fields • Trigger Fields are fields within a table that you want to track history • Non-Trigger fields are those which you do not want to track history
Type 3 Prime –All and No History Non Trigger Fields Trigger Field