380 likes | 759 Views
Surrogate Keys & Changing Dimensions. Prof. Navneet Goyal Computer Science Department BITS, Pilani. Lecture Objectives. Surrogate keys Advantages Generation Changing dimensions Why we need to handle them? Role of surrogate keys in handling changing dimensions. OLTP – Natural Keys.
E N D
Surrogate Keys & Changing Dimensions Prof. Navneet Goyal Computer Science Department BITS, Pilani
Lecture Objectives • Surrogate keys • Advantages • Generation • Changing dimensions • Why we need to handle them? • Role of surrogate keys in handling changing dimensions Dr. Navneet Goyal, BITS Pilani
OLTP – Natural Keys • Production Keys • Intelligent Keys • Smart Keys NKs tell us something about the record they represent For eg. Student IDNO 2003B4A7290 Dr. Navneet Goyal, BITS Pilani
DW - Surrogate Keys • Integer keys • Artificial Keys • Non-intelligent Keys • Meaningless Keys SKs do not tell us anything about the record they represent Dr. Navneet Goyal, BITS Pilani
Surrogate Keys - Advantages • Buffers the DW from operational changes • Saves Space • Faster Joins • Allows proper handling of changing dimensions Dr. Navneet Goyal, BITS Pilani
Surrogate Keys - Advantages Space Saving • Surrogate Keys are integers • 4 bytes of space • Are 4 bytes enough? • Nearly 4 billion values!!! • For example • Date data type occupies 8 bytes • 10 million records in fact table • Space saving=4x10million bytes =38.15 MB Dr. Navneet Goyal, BITS Pilani
Surrogate Keys - Advantages Faster Joins • Every join between dimension table and fact table is based on SKs and not on NKs • Which is faster? • Comparing 2 strings • Comparing 2 integers • But the issue is – Do we need joins in the first place? Dr. Navneet Goyal, BITS Pilani
Surrogate Keys - Advantages Buffering DW from operational changes • Production keys are often reused • For Eg. Inactive account numbers or obsolete product codes are reassigned after a period of dormancy • Not a problem in operational system, but can cause problems in a DW • SKs allow the DW to differentiate between the two instances of the same production key Dr. Navneet Goyal, BITS Pilani
Surrogate Keys - Advantages Handling Changing Dimensions • What are changing dimensions? • Why we need to handle them? • How we can handle them? Dr. Navneet Goyal, BITS Pilani
Changing Dimensions • Slowly Changing Dimensions • Rapidly Changing Dimensions • Small Dimensions • Monster Dimensions Dr. Navneet Goyal, BITS Pilani
Slowly Changing Dimensions • Type I Change (overwrite) • Type II Change (new record) • Type III Change (new attribute) • Hybrid Approach • Predictable changes with multiple version overlays • Unpredictable changes with single version overlays Dr. Navneet Goyal, BITS Pilani
Slowly Changing Dimensions Type I Change • Overwrite Old Value • Used in cases where old values has no significance • Error correction • Spelling error • Example 244807 Dr. Navneet Goyal, BITS Pilani
Slowly Changing Dimensions Type I Change • Fast & Easy to implement • Attribute value always reflect the latest assignment • No history of prior attribute values • In DW environment, can we afford to do that? • NO!!! Dr. Navneet Goyal, BITS Pilani
Slowly Changing Dimensions Type I Change: PROBLEMS • Example SK Description Department NK • History of attribute changes is lost • If there is any increase in sale of the product, the management would not know the reason • Aggregates over department have to be rebuilt Dr. Navneet Goyal, BITS Pilani
Slowly Changing Dimensions Type II Change • Add a dimension row • Keeps track of history • New record approach • Cannot be implemented without the help of SKs!! Dr. Navneet Goyal, BITS Pilani
…. 1234567 Navneet Goyal Single CUST11111 …. 1234567 1234600 Navneet Goyal Married CUST11111 ….. 1234600 1234600 Slowly Changing Dimensions Type II Change: Examples Fact Table Dimension Table 1234567 1234567 SK Product Department NK Dr. Navneet Goyal, BITS Pilani
Slowly Changing Dimensions Type II Change: Advantages • Automatically partitions history in the fact table • Customer profile is easily differentiated • Tracks as many dimension changes as required • No need to rebuild aggregates Dr. Navneet Goyal, BITS Pilani
Slowly Changing Dimensions Type II Change: Disadvantages • Dimension table can become big • Does not allow association of the new attribute value with old fact history & vice-versa • When we constraint on Dept=Strategy, we will not see Intellikidz1 facts from before the change date Dr. Navneet Goyal, BITS Pilani
Slowly Changing Dimensions Type III Change: • Add a dimension column • Alternate Reality • Both current & prior values can be regarded as true at the same time • New and historical fact data can be seen either with the new or prior attribute values • Not used very often Dr. Navneet Goyal, BITS Pilani
Slowly Changing Dimensions Type III Change: Example SK Product Department NK SK Product Old_Dept New_Dept NK Dr. Navneet Goyal, BITS Pilani
Slowly Changing Dimensions Type III Change: Problems • Good for handling predictable changes • Can lead to lot of wastage of space • Myriad of unpredictable changes • Cannot track the impact of numerous intermediate attribute values Dr. Navneet Goyal, BITS Pilani
Generating Surrogate Keys • Initial Load • Straight forward • Subsequent Load • Relatively complex Dr. Navneet Goyal, BITS Pilani
Generating Surrogate Keys Initial Load Deduplication has been done in DSA Production Key Surrogate Key Prod 1 1 Prod 2 2 Prod 3 3 and so on.. Dr. Navneet Goyal, BITS Pilani
Generating Surrogate Keys Subsequent Load • Data Warehouse Refresh • New data has to be brought into the DW • Old data is to be archived • Rolling window • Every incoming NK is to be compared with the existing NK in dimension table • If it does not exit – simply assign a new SK • If it exists – do field by field comparison to see if any attributes have changed • If no change – simply ignore it • If any change – assign new SK (Type 2 change) Dr. Navneet Goyal, BITS Pilani
Generating Surrogate Keys Figure taken from Kimball’s article The original loading of a dimension. Surrogate keys are just assigned sequentially to every input record. The original production key becomes an ordinary attribute. Dr. Navneet Goyal, BITS Pilani
Production Key Surrogate Key Prod1 1 Prod2 2 Prod3 3 Prod4 4 Prod5 5 Generating Surrogate Keys Subsequent Load: LOOK UP Tables • Dimension tables may have hundreds of attributes • Only a few records can be loaded in memory for comparison • LOOK UP tables contain a mapping between NK & SK Dr. Navneet Goyal, BITS Pilani
Production Key Surrogate Key Prod1 6 Prod2 2 Prod3 3 Prod4 4 Prod5 5 Lookup Tables: Advantages • Makes generation of SKs faster • Can be indexed suitably for further speeding up the process • Refreshing of Dimension Tables speeds up • Populating Fact Tables becomes faster • Always points to the latest dimension record Dr. Navneet Goyal, BITS Pilani
Order of Load • Dimension tables • Fact Tables • Fact Table Loading: • In the FT record, simply replace the natural key with the surrogate key Dr. Navneet Goyal, BITS Pilani
Lookup Tables Figure taken from Kimball’s article The lookup table for a typical dimension. There are as many rows as there are unique production keys. The second column is the currently in-force surrogate key used with each production key. Dr. Navneet Goyal, BITS Pilani
Figure taken from Kimball’s article Dimension processing logic for all refreshes of a dimension table after the original load. . Dr. Navneet Goyal, BITS Pilani
Figure taken from Kimball’s article The pipelined, multithreaded fact table processing logic for replacing all production keys (designated here as IDs) with current surrogate keys. Dr. Navneet Goyal, BITS Pilani
Sparsity Failure Dr. Navneet Goyal, BITS Pilani