Presentation # 506

IOUG Live! 2005 Design Tips for the Warehouse Architect Presentation # 506 David Stanford President Red Sky Data Inc. david.stanford@redskydata.com

Objectives • Obtain a clear understanding of data warehouse design ‘hot points’ • Identify solutions and alternatives for these ‘hot points’ • See how real world solutions are implemented

Agenda • Top 10 Gotchya’s • Design Traps • Loading Dirty Fact Data • Surrogate Keys • The Staging Area • Slowly Changing Dimensions • Tracking All History • Audit Considerations • Bad & Missing Data • Administrative Fields • Other Tidbits of Advice

Dave’s Top 10 Gotchya’s • Failing to model for both a) view of the data when the event occurred and b) view of the data as of today’s reality • Limiting the number of dimensions • Failing to model and populate a meta data repository • Failing to provide sufficient audit capabilities to verify loads against source systems • Not using surrogate keys for everything

Dave’s Top 10 Gotchya’s • Failing to design an error correction process • Normalizing too much • Not using a staging area • Failing to load ALL of the fact data • Failing to classify incorrect data • Making it too complex!

Design Traps Design Review Staging Area Surrogate Keys Facts – Surrogates and Dirty Data

Data Warehouse Process Data Characteristics • Raw Detail • No/Minimal History • Integrated • Scrubbed • History • Summaries • Targeted • Specialized (OLAP) Source OLTP Systems Data Marts Staging Area Data Warehouse • Design • Mapping • Extract • Scrub • Transform • Load • Index • Aggregation • Replication • Data Set Distribution • Access & Analysis • Resource Scheduling & Distribution Meta Data System Monitoring Source: Enterprise Group

Where The Work Is Source OLTP Systems Data Marts Over 80% of the work is here Staging Area Data Warehouse • Design • Mapping • Extract • Scrub • Transform • Load • Index • Aggregation • Replication • Data Set Distribution • Access & Analysis • Resource Scheduling & Distribution Meta Data System Monitoring Source: Enterprise Group

Warehouse Design • Normalized (Relational) Design • Dimensional Design – Star and Snowflake • Hybrid Design • In reality, the DW is more normalized but has elements of dimensional design • The data marts are star schemas but have elements of normalization

Modelling is not straight forward Marital Status Age Member Donation Time Gender Campaign Income Location

Should These Be Combined? MaritalStatus Age Member Donation Time Gender Campaign Income Location

Behind The Scenes • There are several aspects of a design that users don’t directly see: • Meta Data • Error Correction • Audit • Load Control (if not using a scheduling tool) • Transformation Tables (used for transforming the data prior to being loaded into the DW)

Error Correction Audit Load Control Transform Tables Meta Data Behind The Scenes Data Marts Source OLTP Systems Staging Area Data Warehouse

A 10 Step Design Process • Identify major subject areas or topics • Declare the Grain • Add element of time to the tables • Create appropriate names for tables, columns, and views • Add derived fields where applicable • Add administrative fields • Consider security and privacy in design • Make sure data model answers the critical business questions • Consider meta data • Consider error correction • Performance considerations: Tune, Tune, Tune

Independent of Approach… …the goal of the data model is to satisfy two primary criteria: 1. Meet Business Objectives 2. Provide Good Performance

Staging Area

Staging Area Data Characteristics • Raw Detail • No/Minimal History • Integrated • Scrubbed • History • Summaries • Targeted • Specialized (OLAP) Source OLTP Systems Data Marts Staging Area Data Warehouse • Design • Mapping • Extract • Scrub • Transform • Load • Index • Aggregation • Replication • Data Set Distribution • Access & Analysis • Resource Scheduling & Distribution Meta Data System Monitoring Source: Enterprise Group

The Staging Area • Holds a mirror copy of the extract files • Allows pre-processing of the data before loading • Allows easier reloading (you WILL do this) • Keeps more control with the DW team, rather than with an external group (the extract team)

The Staging Area • Facilitates easier audit processes • Can facilitate error correction processes • Helps identifying the Record Type (translates into easier ETL processing and logic)

Surrogate Keys

Surrogate Keys • A surrogate key is a system generated, unintelligent, single column, unique identifier for each row within a table • Always use surrogate keys for dimensions • Always use surrogate keys for the time dimension • Always use surrogate keys for transformation tables • Always use surrogate keys for EVERY table • ..and this includes FACT tables

Surrogate Keys Avoid… • Duplicate keys from different source systems • Recycling of primary keys • Use of the same key for different business rows • Lengthy composite key joins • Space in fact tables • Application changes or upgrades in source systems

Using Surrogates In Fact Tables • You will need a surrogate key on the fact table if you allow ‘unknown’ values into the fact table (which is recommended by the way) • The Primary Key of a fact table is typically the combination of the base dimensions

Surrogates In Fact Tables

Surrogates In Fact Tables This results in a duplicate primary key in the table

Surrogates In Fact Tables Thus the need for a surrogate primary key

Load “Dirty” Data Into The Fact • Ties out to source systems • Gains credibility with end users • Requires a few design resolutions: • Bad & Missing (BAM) Logic • Surrogate Keys in the Fact tables • Still 100% accurate – we don’t load the bad values, we identify the bad values for correction later • Empowers the End Users to decide if the “dirty” data will invalidate their analysis

Tracking History

Tracking History in Dimensions • Type 1 – No history • Type 2 – All history • Type 3 – Some history

Type 1 – No History

Type 2 – All History

Type 3 – Some History

More Dimension Types…Combinations • Type 3 Prime – Types 1 and 2 (the most common) • Type 4 – Types 1 and 3 • Type 5 – Types 2 & 3 • Type 6 – Types 1, 2, and 3 (the second most common)

Trigger Fields • Trigger Fields are fields within a table that you want to track history • Non-Trigger fields are those which you do not want to track history

Type 3 Prime –All and No History Non Trigger Fields Trigger Field

Non Trigger Field Update

Trigger & Non Trigger Field Update

Changes One At A Time

Expect To Track Everything • Users want to view the data as it was when the transaction or event occurred AND… • Users want to view the data in the context of today’s realities THUS, model for both!

In order to provide these two views, consider adding ‘current’ columns to tables. This is a special Type 6. These fields get updated in historical records when a trigger field changes value in the current record. This simplifies the use of the DW by the users It’s easier to understand than having to write complex SQL Add ‘Current’ Columns

Type 6 – All, Some, and No History

Most Recent Flag • Tracks the Most Recent record in time (not loaded, but based on a time series) • Should be added to the dimensions as a Yes/No (1/0) field • The most recently loaded record is set to Yes, all other records are set to No • Allows user to restrict on the Most Recent Flag to get a view of the world today

Double Keying Type 2 Dimensions • Double surrogate key in Type 2 dimensions • 1 key is unique for each individual row • 1 key is unique for each individual business key • Protects against: • Authoritative source system changes / duplication

Rapidly Changing Dimensions • Rapidly Changing Dimensions (RCD’s) need to be partitioned • Use Oracle partitioning • Include the native partition key in the dimension • Or split into several tables

BAM Rules, Audit & Administrative Fields

Bad & Missing Fact Data • Bad and/or missing data will be always be an issue • The source data is never completely clean • There are always exceptions • Recall that you need to tie back into the source systems for your audit, thus you must load this ‘incorrect’ data • Put the decisions into the hands of your users – don’t decide for them whether the data is good enough or not • Need to develop Bad & Missing (BAM) Rules

Presentation # 506

Presentation # 506

Presentation Transcript

Presentation Title Presentation Subtitle

Presentation Presentation

Presentation on Presentation

Presentation Title Presentation Subtitle

PRESENTATION TITLE PRESENTATION TITLE PRESENTATION TITLE

Presentation title Presentation title

Presentation Name Presentation Date

A Presentation on Presentation